A Fourier analysis of symmetry in protein structure

William R. Taylor,1, Jaap Heringa, Franck Baud2,3 and Tomas P. Flores,4

Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, UK and 2 Mathematics Department, Stanford University, Palo Alto, CA, USA


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
The score matrix from a structure comparison program (SAP) was used to search for repeated structures using a Fourier analysis. When tested with artificial data, a simple Fourier transform of the smoothed matrix provided a clear signal of the repeat periodicity that could be used to extract the repeating units with the SAP program. The strength of the Fourier signal was calibrated against the signal from model proteins. The most useful of these was the novel random-walk approach employed to generate realistic ‘fake’ structures. On the basis of these it was possible to conclude that only a small proportion of protein structures have an unexpected degree of symmetry. Artificially generated ‘ideal’ folds provided an upper limit on the strength of signal that could be expected from a ‘perfectly’ repeating compact structure. Unexpectedly, some of the very regular ß-propellor folds attained the same strength but the majority of symmetric structures lay below this region. When native proteins were ranked by the power of their spectrum a wide variety of fold types were seen to score highly. In the ß{alpha} class, these included the globular ß{alpha} proteins and the more repetitive leucine-rich ß{alpha} folds. In the all-ß class; ß-propellors, ß-prisms and ß-helices were found as well as the more globular {gamma}-crystalin domains. When this ranked list was filtered to remove proteins that contained detectable internal sequence similarity (using the program REPRO), the list became exclusively composed of just globular ß{alpha} class proteins and in the top 50 re-ranked proteins, only a single 4-fold propellor structure remained.

Keywords: protein structure repeats/Fourier transform symmetry


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
It has been apparent since the earliest protein structures were solved that some structures contain repeated substructural elements (Rao and Rossmann, 1973Go), often to a remarkable degree (Banner et al., 1975Go). Over the years, these substructures, which are often referred to as super-secondary structure, have been extensively studied. Similarly, larger repeated structural components (domains) have been analysed (Heringa and Taylor, 1997Go). Most analysis has taken the approach of dissecting out and variously classifying and counting the different substructures (Flores et al., 1993Go; Salem et al., 1999Go) or larger assemblies (Blundell and Srinivasan, 1996Go). While this gives a good qualitative overview, it is difficult, using this approach, to get quantitative values that can be compared across structures with different super-secondary structures and across the range of size from secondary structure to domains.

The equivalent problem in the one-dimensional-(1D) world of the sequence has been approached both by direct internal comparisons (Heringa and Argos, 1993Go) and by Fourier transform (McLachlan and Stewart, 1977Go), which is particularly suitable for the sequences of long fibrous proteins (McLachlan, 1977Go, 1983Go). The former approach can be directly transposed to 3D structures simply by repeated superposition of fragments (McLachlan, 1979Go; Matthews and Rossmann, 1985Go). However, while the Fourier transform is a natural tool to adopt for the analysis of repetition in the 1D sequence, its extension to structural data is not straightforward. In Cartesian coordinates (with each atomic point modelled by a Gaussian density) the 3D Fourier transform cannot be interpreted easily. Attempts have also been made using spherical coordinates (harmonics) (Duncan and Olson, 1993Go) but addressing protein topography and not topology.

In this work we describe a Fourier analysis at an intermediate, 2D level based on the structural similarity matrices used in protein structure comparison. In this representation, symmetric1 structures appear as off-diagonal ‘ridges’ of high score (as in the sequence ‘dot-plot’) and the periodicity of these ridges can be analysed by the Fourier transform. To allow for the quantitative interpretation of the resulting spectra, we also describe a normalization method based on ‘random’ structures.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
Protein structure comparison

The SAP program (Taylor, 1999Go) was used to calculate the similarity between two proteins. This method uses the double dynamic programming (DDP) algorithm (Taylor and Orengo, 1989Go). In the SAP program, a pair of positions (one residue from each structure) is compared by matching their interatomic vectors to all other residues in their respective struc tures. This is achieved by constructing a matrix, defined by the sequence of one structure against the other, which is composed of a measure based on relative geometry of the pair of pairs (Figure 1Go) (Taylor, 1999Go). In structure comparison, this matrix (referred to as the ‘low-level matrix) provides the base from which an alignment is extracted (using a standard dynamic programming algorithm). The alignments over all pairs of residues (between the two structures) are summed, forming another matrix (‘high-level’ matrix) from which the final alignment is extracted.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1. Outline double dynamic programming algorithm. Values from the HIGH LEVEL score matrix are ranked and a prespecified number (represented by the dashed cutoff line) are passed to the LOW LEVEL for evaluation. In the current application these are perturbed at this point by a random displacement before sorting. At the LOW LEVEL, the best path is found that passes through the selected pair and the resulting alignment paths are summed back into the HIGH LEVEL score matrix. After normalization of the values, this matrix is used for the Fourier transform. (For structure comparison, the cycle is normally iterated five times.)

 
In the SAP program, a small number of selections are made and these are then increased over successive iterations. For the current application, there was no iteration and the number of initial selections was correspondingly increased to equal the length of the protein (equivalent to the number in the final cycle of the iterated version). To prevent the selections from clustering on any strongly similar local sub-structures, a random perturbation factor was included in the choice of selected pairs. At the end of the comparison calculation, the high-level score matrix was normalized to attenuate elements with a value >3 SD (as calculated over the whole matrix).

Smoothing the score matrix

In a protein structure that contains repeated sub-structures, these will form diagonal ridges across the matrix parallel to the main diagonal (i = j) and it is the strength and periodicity of these undulations that can be analysed using a Fourier analysis. However, it was found that often these ridges are sharply defined and the resulting spikes or peaks generated a high ‘background’ level in the Fourier transform spectrum. (It should be remembered that the transform of a ‘spike’ gives rise to an equal score across the full frequency range.)

To make the frequency spectrum easier to interpret, the high-level matrix was smoothed. Simple averaging was avoided as this tended to eliminate locally strong similarities. (For example, the peak from two corresponding secondary structures would be averaged with the weak signals from their flanking coil regions.) Instead the ‘averaging’ was made by taking the maximum over adjacent values—equivalent to a dynamic programming algorithm, as follows:


(1)

The factor of a half applied to off-diagonal contributions is equivalent to a gap-penalty and dampens smoothing in the anti-diagonal direction while still allowing the propagation of some signal across small insertions.

This operation was applied to each cell in the matrix A (with increasing i and j) creating the matrix B. As B will be asymmetric, the smoothing operation was then applied to B with decreasing i and j, recreating a new matrix A. This flip-flop smoothing was repeated 20 times resulting in a matrix that still contained substantial detail. The effect of this repetition can be pictured by considering a single diagonal line (i = j + n) in the matrix, which for half its length has a value 1 and is otherwise 0. After 20 iterations, the sharp edge will be smoothed to a sigmoid curve which makes most of the transition from 1 to 0 over a range equal to the number of smoothing iterations. More generally, a single point will be smoothed into a Gaussian (bell-shaped) curve of the form exp(–x2/n), where n is the number of iterations. By anology with optics, this means that two ‘spikes’ closer than 6.5 residues will not be resolved. In terms of protein structure, short-range features will be smoothed away, but typical secondary structures will remain distinct. It should be remembered, however, that this smoothing takes place after comparison by the SAP algorithm, so all features, whatever their size, are fully considered in establishing internal similarities.

Fourier transform

The remaining problem with the data generated as decribed above, is the direction in which to calculate the Fourier transform. As the matrix is symmetric, this leaves a choice between the direction parallel to an axis or parallel to the minor diagonal (i + j = N, where N is the length of the protein, referred to below as the antidiagonal). The former has the problem that the diagonal itself constitutes an anomaly in the scores since the similarity of a residue to itself (plus environment) is very high. In the later direction, however, there is only one line in this direction that covers the full length of the protein (the antidiagonal itself).

Fourier calculation.As proteins are short and computers are fast, the fast-Fourier transform (FFT) algorithm was not used as this requires special treatment of the signal to render it in lengths of powers of two (Press et al., 1986Go). Instead, the direct approach was used of multiplying sine and cosine waves over a range of frequencies and plotting the power of each component as a spectrum. Each term in the spectrum was normalized for the length of the protein (N) as: 100(s2 + c2)1/2/N, where s and c are sine and cosine (real and imaginary) components of the transform. (See Figure 2Go for examples and a comparison to the FFT method.)




View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2. Example Fourier transform power spectra. The power spectra are plotted for the middle five rows from the score matrices of two proteins. (a) the small ß{alpha} protein cheY (3chy, 128 residues) and (b) the last nine repeats of the ribonuclease inhibitor (2bnh, 256 residues) (Figure 4Go). 3chy has five ß{alpha}-units while the 2bnh fragment has a clear 9-fold repeat. The latter repeats were strong enough to give rise to harmonics (peaks at multiples of 9). The spectra are plotted for the current implementation (full) and also for the FFT algorithm (dashed) using the routine realft from the Numerical Recipes collection (Press et al., 1986Go). The choice of proteins with lengths equal to powers of 2 allowed the unbiased comparison of the two algorithms.

 
The Fourier method detects a signal from the relative spacing of features, and not from their number, so it should be noted in these spectra, that a peak at 5 means ridges occur in the score plot with a spacing of 1/5 the protein length—but there are not necessarily five of them. In addition, the size of the components giving rise to the repeat is not easily found in absolute terms as the same period will contain differing numbers of residues in proteins of different sizes while the substructures might be the same size but have longer ‘loops’ between.


For the calculation of the transform in the antidiagonal direction, the use of the antidiagonal alone may be unrepresentative and a band of 20 antidiagonals was taken. Such a construction, however, creates outer edges that are shorter than the antidiagonal and to correct this, the ‘missing’ corners were padded by repeated reflection of the diagonal at each level.

For the calculation in the direction of the rows, two variants were tried: firstly, with the untouched score matrix and secondly, with the row shifted by one in each column and wrapped so as to shift the diagonal to the edge.

Variance calculation. Each row in the matrix (or each antidiagonal) was individually transformed and the spectra averaged (rather than transforming an average signal). This approach has the advantage that the variance of each component can then be calculated as a guide to significance of the peak and potentially used as a weighting factor.


    Random models
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
To assess the significance of the peaks in the spectra derived from native proteins, a random protein model was required. This was obtained from two sources: one being a constrained random walk (which would be expected to have less symmetry than most ‘real’ proteins) while the other was generated from fold permutations over regular packed secondary structure elements (which would be expected to gave greater symmetry than most ‘real’ proteins).

‘Random’ fake proteins. Random proteins have been generated many times in the past for testing purposes and a simple algorithm to do this is a self-avoiding random walk from one {alpha}-carbon to the next constrained to lie inside a sphere or ellipsoid (Cohen and Sternberg, 1980Go; Thornton and Sibanda, 1983Go). The algorithm employed below is similar but incorporates a local, rather than a global, constraint for the chain to be confined. This was implemented by selecting the next position in a growing chain to be preferentially in contact with its predecessors using the following algorithm (in C code):

The density of the ‘fake’ structures are dependent on the number of target neighbours (controlled by the variable n in the above code). If this is set to find too few contacts then the resulting structures are not sufficiently compact while if set to find many neighbours, then the time taken increases. Trials indicated that aiming for four or more neighbours produced compact structures while aiming for only two neighbours was not sufficient (n = –1). However, as the program need be run only once to produce a databank of ‘fake’ structures, the denser models were chosen (n = 3).

The resulting structures are remarkably ‘life-like’ (Figure 3Go), even incorporating elements reminiscent of secondary structure and, if allowed to grow large enough, ‘breaking-up’ into domain-like regions (Figure 3cGo).



View larger version (79K):
[in this window]
[in a new window]
 
Fig. 3. Fake model proteins. (ad) Typical structures produced by the random-walk algorithm described in Methods for four different chain lengths. Secondary structure and domain-like components can be seen. The figures were produced by the program RASMOL and are coloured from blue (amino) to red (carboxy). (e and f) Proteins with ideal secondary structure packing were generated ranging from small structures with a five-stranded ß-sheet to large structures with 13 strands in the sheet. The former is viewed across the sheet while the latter is viewed end-on to the sheet. This view reveals a particularly symmetric structure with each ß{alpha}-unit ‘jumping’ to the right by two strands then ‘leap-frogging’ back towards the left. These models have the correct chirality of secondary and super-secondary structures with no loops crossing each other. Parallel connections between sequential secondary structures were only allowed on the edge of the sheet. The number of folds assessed for the largest structures was 345 million. Each ß{alpha}-unit is coloured from blue (amino) to red (carboxy).

 
Regular fake proteins. The more regular class of ‘fake’ proteins were derived from ‘stick’ models of proteins, in which each secondary structure was represented by a simple line, following the models of Cohen et al. (Cohen et al., 1982Go). These were then ‘expanded’ into {alpha}-carbon-models as described previously (Taylor, 1991Go). From a single basic ‘architecture’ of unconnected, undirected secondary structure line segments, many different folds can be generated by the combinatoric enumeration of all possible connections of the line segments. For the ß{alpha} class models these were further constrained to incorporate only right-handed connections between the ß{alpha}ß units with no connecting loops crossing. Typical examples are shown in Figure 3Go. In these models, unlike those described in the previous section, all {alpha}-helices have the correct chirality and ß-sheets have the correct twist.


    Visualizing repeats
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
The power spectrum from a protein structure may indicate the period of the repeated structures but it does not provide an exact definition of their location or boundaries. These aspects must be extracted using more conventional structure comparison, but this can be guided by the Fourier analysis.

It might be thought that this information could be extracted using wavelet analysis, however, this technique can only be used where the feature is much smaller than the length of the signal (such as a short pattern of residues in a long protein sequence). In the current application the features constitute a large fraction of the length of the protein, and indeed, at this scale, wavelet and Fourier transform methods converge.

Biasing SAP to overlap repeats. In the SAP score matrix, each off-diagonal ‘ridge’ represents a region in which repeats can be aligned. The SAP comparison was focused on each solution in turn by convoluting a Gaussian (bell-shaped) function over the off-diagonal ridge in the score matrix. The width of the bell-curve was set as a function of the period of the repeat such that all neighbouring ridges were essentially eliminated, as follows:


(2)
where N is the length of the protein, r is the rth ridge away from the diagonal (r = 0) and p is the frequency of the repeat. Equating the above bell-curve with the normal distribution would give a SD of N/2p. This means that the neighbouring ridges are 2 SD away, at which point the value of the curve is: 0.0183—sufficiently small to prevent the alignment path from jumping between ridges.

In a protein containing five repeated sub-structures, A . . . E, the solution obtained with p = 5 and r = 3 is the alignment of sub-structures, A, B, C with C, D, E.

Delineating repeat boundaries. The same masking approach can be used to identify repeat boundaries by focusing on the main diagonal (r = 0) and extracting the alignment path, then cyclicly shifting the original score matrix rows by one period (N/p) and repeating the alignment. After a full set of shifts, the points at which each of the p alignment paths cross the edge of the matrix gives the repeat boundaries.


    Results
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
Choice of transform direction

The highly repetitive protein structure of the ribonuclease inhibitor H1 (2bnh) was used to assess the two transform directions described above. This structure consists of 16 (leucine-rich) ß{alpha} repeats arranged in a regular arc (Figure 4Go).



View larger version (88K):
[in this window]
[in a new window]
 
Fig. 4. Ribonuclease inhibitor protein. The leucine rich repeats in the protein 2bnh were used to derive a series of truncated structures for testing the Fourier method. These either omitted the last strand, giving 16 ß{alpha}-units or omitted the first strand giving 16 {alpha}ß-units. The figure was produced by the program RASMOL.

 
For speed in initial testing, only the C-terminal eight repeats of the 2bnh structure were used. With the antidiagonal variant, these produced a clear spectrum with a dominant peak at period 8 and further significant peaks visible out to the second harmonic (Figure 5Go).




View larger version (99K):
[in this window]
[in a new window]
 
Fig. 5. Antidiagonal Fourier transform. (a) A band of width N/2 (N = protein length) in the antidiagonal direction of the score matrix (black->white = high->low score) was transformed giving (b) the power spectra averaged over each antidiagonal plotted in (a). Error bars are drawn at ±{sigma} (1 SD).

 
The row-based variant, as expected, had a very strong diagonal, and although the correct period was evident, the background level of the power spectrum was elevated by this sharp peak. To reduce the effect of the strong diagonal component, the level of sampling on the diagonal was reduced in the SAP algorithm by multiplying each term in the score matrix by successive random numbers (drawn evenly across the inverval 0 . . .1). With one random number, the diagonal was still dominant, but with three, it had been reduced to a level similar to the off-diagonal ridges (Figure 6Go). With more than three reduction steps, it began to disappear.






View larger version (694K):
[in this window]
[in a new window]
 
Fig. 6. Diagonal damping. The SAP score matrix is shown with differing degrees of sampling on the diagonal. The selection of diagonal pairs in SAP was reduced by successive multiplication by n random numbers. Plots of the C-terminal eight repeats of the protein 2bnh are shown for values of n = 0...3 (ad). The highest values in the plots are black.

 
Using the product of three random numbers, the spectrum produced by the row based method was ‘sharper’ than the antidiagonal variant, with smaller variance on the peaks, and harmonics visible out to the fourth (along with some sub-harmonics between the peaks). The shifted variant, in which the diagonal had been brought to an edge (and ignored), showed no improvement over the transform with the full row length.

With this treatment of the diagonal, the row-based approach appeared preferable to the antidiagonal approach as it permits the full data matrix to be used, allowing statistics to be calculated over a bigger sample than the limited band along the antidiagonal. The transform in the (unshifted) row direction was used to generate all further results presented below.

Analysis of total power

Model repeat proteins. The ribonuclease inhibitor structure 2bnh (Figure 4Go) was used to generate a double series of model structures. For one, the last strand was deleted, leaving exactly 16 ß{alpha}-units. These were then successively deleted in turn from the N-terminus (being the least regular) giving 16 coordinate sets each with a different number of repeats. For the second series, this was repeated but with the last strand deleted giving a series of 16 {alpha}ß-units.

The total power of the spectrum obtained when transforming each construct in these series of structures is plotted against the number of residues in the structure in Figure 8Go. The power rises quickly followed by a slower, linear increase. The combined trend can be reasonably modelled with the combination of a linear component and an inverted Gaussian function:




View larger version (51K):
[in this window]
[in a new window]
 
Fig. 8. Total power against protein length. (a) Model proteins, including: the deletion series for the 2bnh structure based on its ß{alpha}-unit and {alpha}ß-unit (‘zig-zag’ lines) with a summary curve N + 2400(–3N2/104) (dashed curve); random-walk (‘fake’) proteins (+s) with their upper boundary summarized by 2N + 1400(–4N2/104) (lower curve) based on a sample of 20 structures for each length (Figure 3Go). Ideal ß{alpha} models are plotted as symbols, based on (ß{alpha})nß with n = 4 (x), n = 6 (*), n = 8 ({square}), n = 10 ({blacksquare}), n = 12 ({circ}). (b) Native protein data with the summary curves from the model data replotted for reference.

 

(3)
giving the total power (p) from the number of residues x with the coefficients a = 1, b = 2400 and c = 3.

Ideal ß{alpha} proteins. Using the same ß{alpha}-unit lengths as in 2bnh, model protein structures were generated with 4, 6, 8, 10 and 12 ß{alpha}-units (plus a terminal ß-strand) and a sample of 15 different folds were taken from each. These are plotted on Figure 8aGo where it can be seen that these structures have a very high power, with only the weaker members being comparable to the 2bnh series.

‘Random’ proteins. A series of random-walk (‘fake’) proteins from 50 to 450 residues was generated in steps of 50 residues. A sample of 20 from each length was transformed and the resulting power of the spectrum plotted in Figure 8aGo.

This series shows a slight dependence on length, with the power level rising from around 1200 (±200) for structures with 100 residues. The upper limits of this distribution can again be modelled by the same function as described above (Equation 3) with values for the coefficients a = 2, b = 1400 and c = 4. This curve provides a good base-level against which the power of ‘real’ proteins can be assessed.

‘Real’ proteins. The program was run on a selection of proteins from the PDB (as used previously by Jonassen et al., 2000Go) and the power of their spectra plotted against length (Figure 8bGo). The majority of proteins lie below the upper limits attained by the ‘fake’ proteins but a considerable number (17%) still lie above this line and even above the values attained by the series of 2bnh repeats (1%), with a few approaching the highest values seen for the ideal model proteins.

Inspection of the highest scoring structures revealed a collection of symmetric folds, dominated by the very regular ß-propellor structures (Murzin, 1992Go) but also containing many regular ß{alpha}-folds, including TIM barrels (Table IGo). These results can be summarized by plotting the total power for each protein along with its secondary structure composition, while colouring the protein as a function of the most dominant period in its spectrum (Figure 9Go).


View this table:
[in this window]
[in a new window]
 
Table I. Proteins ranked by symmetry
 


View larger version (20K):
[in this window]
[in a new window]
 
Fig. 9. Fourier results plotted with secondary structure composition. The percentage of ß-structure (x-axis) and the percentage of {alpha}-structure (y–axis) lie in the plane of the paper with the total power of the frequency spectrum plotted out of the page (z-axis, shown as a stereo-pair). Each protein that had a power above random (1400 + 2N, where N is the number of residues) is plotted as a sphere coloured by the frequency of the highest peak in the spectrum (red, high frequency; blue, low frequency). The figures were produced by the program RASMOL.

 
In this plot, the ‘bright’ central region corresponds to the ß{alpha} proteins (TIM barrels, Rossmann folds and leucine-rich repeats) while the bright edges are the more repetitive all-{alpha} and all-ß folds.

Removing expected symmetries

Some of the proteins included in Table IaGo contain internally duplicated domains that are apparent from sequence comparison. To access the more unexpected symmetries, these were filtered using the program REPRO (Heringa and Argos, 1993Go). This adjustment was made by taking the normalized power (s) (column score in Table IaGo) divided by the REPRO score (r) as: t = s exp(–r2/105). The new score t is typically s/1000 for the strongest sequence repeats (r = 831), falling to s/10 for clear repeats (r = 480) and 3s/4 for weak repeats (r = 170). These modified results are shown in Table IbGo.

The greatest change between the rankings is the disappearance of the highly repetitive folds seen at the top of Table IaGo (ß-propellors, ß-prisms and ß{alpha}-arc proteins). The single example of this type that holds its place in the top 50 is the 4-fold ß-propellor 1gen and although its internal repeat was recognized by REPRO, the sequence identity over the repeats is <20%. Those remaining in the filtered list (Table IbGo) were overwhelmingly of the globular ß{alpha} fold class and are dominated by the Rossmann-like folds (which contain a pseudo-2-fold) and the ß{alpha}-barrel (TIM-like) fold which have 8-fold cyclic symmetry (but also incorporate many deviations). The only globular ß{alpha} class protein to drop markedly in the rankings was the von Willebrand factor protein 1atzA which, as well as a very symmetric fold (‘classic’ Rossmann fold2)2, has sufficient sequence similarity in the two halves for REPRO to pick-up a repeat.

A fold that makes a stronger appearance in the filtered list is the {alpha}ßß{alpha} layer protein (ABBA in Table IGo) 1ryp1 from the proteasome which contains an internal structural duplication. This symmetry runs through three of the four layers of secondary structure and, although it is not clear to the ‘eye’, it was identified as a 2-fold repeat in the Fourier spectrum (Table IbGo). The sequence identity over the repeats is <10% which would not be seen by any sequence-based method. The 14 related chains from the proteasome, labelled 1, 2 and A–K and the related heat-shock protein structure 1dooA, all score above random and by the scoring used in Table 1bGo, lie in ranked positions: 30, 86 (chains 1, 2); 122, 189, 265, 123, 148, 139, 71, 58, 77, 264, 127 (chains A–K) and 63 (1dooA).

The more obvious structural repeat seen in the double domain {gamma}-crystalins (1bd7A = 1.4 Å/84 res. 36% sequence ID; 1gcs = 1.6 Å/84 res. 36% sequence ID) was easily eliminated by REPRO but the more distant internal (Greek-key) internal repeat within each individual domain did not score enough to be down-graded and the structure 1dsl holds its position after filtering.

As would be expected, there are no sequence repeats detected by REPRO that do not have a corresponding structural repeat. Although this simply reflects the principal that structure is better conserved than sequence, it is interesting to examine the proteins that approach the violation of this principle most closely. These are both artificial constructs, being linked dimers of a globin (1abwA) and the HIV protease (1hvp). Their exact internal sequence repeat gives a very high REPRO score while the single structural duplication, although strong, constitutes only one off-diagonal ridge in the score matrix. The closest native proteins to these are the annexins (1aei and 1axn).


    Discussion
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
Summary of the results

Fourier calculation. The SAP score matrix was found to be a useful construct in which to search for repeated structures using a Fourier analysis. For the Fourier calculation, trials indicated that it was best to avoid artificial constructs, such as the ‘padding’ required for the FFT and the calculation in the direction of the antidiagonal. When tested with model data, the simple approach used above provided a clear signal of the repeat periodicity that could be used to extract the repeating units with the SAP program.

Model structures. The strength of the Fourier signal would have been difficult to interpret without the use of good model proteins for calibration. The most useful of these was the novel random-walk approach employed to generate realistic ‘fake’ structures. On the basis of these it was possible to conclude that only a small proportion of protein structures have an unexpected degree of symmetry.

The artificially generated ‘ideal’ folds provided an upper limit on the strength of signal that could be expected from a ‘perfectly’ repeating compact structure. Unexpectedly, some of the very regular ß-propellor folds attained the same strength of signal but the majority of symmetric structures lay below this region.

Proteins ranked by symmetry. When native proteins were ranked by the power of their spectrum (normalized for length) the ß-propellor folds occupied the top positions but, otherwise, a wide variety of fold types were seen to score highly. In the ß{alpha} class, these included the globular ß{alpha} proteins (TIM barrels, Rossmann-like folds) and the more repetitive leucine-rich ß{alpha} folds. In the all-ß class; ß-propellors, ß-prisms and ß-helices were found as well as the more globular {gamma}-crystalin domains, with both single- and double-domain structures. The less-regular all-{alpha} class was represented by the double {alpha}-barrel structure 1dceB.

When this ranked list was corrected to down-weight proteins that contained detectable internal sequence similarity (using the program REPRO), the list became almost exclusively composed of just globular ß{alpha} class proteins. In the top 50 re-ranked proteins, only a 4-fold propellor structure remained but lying well down on its former ranking.

Assessment of the Fourier approach

Advantages and disadvantages. The Fourier transform is a simple way of extracting the periodicities in a signal and in its application above is able to utilize all the information in the 2D score matrix, also allowing statistics to be gathered on the significance of the peaks in the spectrum. Even without using the FFT algorithm, the calculation takes very little time relative to the calculation of the structure comparison.

As mentioned in the Introduction, a disadvantage of the approach is that it is not possible to tell the number or (within limits) the size of the sub-structures that give rise to any periodicity without returning to examine the original score matrix. Given that some information must be lost in the reduction of a complex structure to a few numbers, this is perhaps not unexpected and (as was outlined in Methods), the transformed signal can still be used to help extract the repeating substructures.

A further ramification of this loss of information is when the sub-structures occur in a range of sizes—as is often seen with the ß{alpha}-units in the 8-fold ß{alpha}-barrel folds. In this situation, rather than the expected sharp peak at frequency N/8, the peak becomes spread (or sometimes split) over adjacent frequencies. Similarly, if the symmetric domain has a large insert or terminal addition, then the peak will again be displaced from the expected frequency. An example of this can be seen at the top of Table IaGo where the number of repeats for the 7-fold ß-propellor structure 1gotB is recorded as eight. Figure 10Go shows the obvious explanation for this is the long N-terminal {alpha}-helix and loop.



View larger version (79K):
[in this window]
[in a new window]
 
Fig. 10. Repeats in a 7-fold ß-propellor fold. The protein 1gotB was identified as a very repetitive structure but with an 8-fold repeat rather than the expected 7-fold corresponding to its propellor structure (Table Ia). However, counting the N-terminal helix and loop (drawn thinner), each propellor ‘blade’ is roughly one-eighth the length of the protein. This illustrates that a peak in the Fourier spectrum does not necessarily correspond to the number of repeated sub-structures. The repeats identified by the visualization method described in the text (Methods) are each coloured differently starting with red for the smallest, through green, to blue for the largest.

 
Improvements and further applications. To provide an unbiased analysis of native proteins, the Fourier transform approach was applied ‘blindly’ to the structures discussed in this work. However, it is clear from the preceding discussion (and from examples in Results) that some pre-filtering of the structures into domains would lead to more periodicities of the expected frequencies. Indeed, the reverse might also be true: that a Fourier analysis could help extract domains with integral multiples of repeats or help to split proteins into repeats (such as large 2-fold repeat units) where this is more desirable.

Repeats in the sequence were identified using the program REPRO but they can also be identified using the same Fourier transform approach as applied here to structures. This was tested by presenting the initial high-level SAP matrix directly to the Fourier transform, without evaluation of any pairs at the low-level (Figure 1Go). Strong signals were obtained but these will need further analysis to find if it is best to use just sequence identity, a Dayhoff-like substitution model or a combination of this and local structural properties as used to initialize the SAP calculation.

An approach to avoid the problems of variable length in the repeating units might be to count the number of peaks in the raw signal (rather than convolute them with periodic function as in the Fourier transform). This is often a difficult process to automate but some recent applications to other data might provide a way forward (May, 2001Go;Taylor, 2001Go).


    Origin of structural symmetry
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
One of the main conclusions of the analysis here is that the majority of proteins have no more symmetry in their structures than would be expected from a compact random-walk. Of those that remain, much can be explained as a result of clear sequence duplication which in general will produce helical structures (e.g. ß-helices and ß-prisms) or in the special case, a closed circle (ß-propellor). After removing these (on the basis of sequence) it was an unexpected result to find that almost all those remaining were globular ß{alpha}-class proteins.

This result was unexpected as there is no obvious structural reason why more globular ß{alpha} proteins should not be found with a clear internal sequence duplication or why more symmetric all-ß or all-{alpha} protein should not exist without an accompanying signal in the sequence. One simple explanation might be that the ß{alpha} class is so numerous that examples in the other classes have just not yet been observed. Alternatively, it might be argued that the more obviously repeating structures are more recently evolved and so retain their sequence signal while those in the ß{alpha} class tend to be ancient metabolic enzymes.

A more structural explanation of the dominance of the globular ß{alpha} proteins might be based on the relative sizes and degrees of structural freedom that are available to the different super-secondary structure types. Those composed of all-ß structure have a geometric regularity imposed by the plane of the ß-sheet but are otherwise relatively topologically unconstrained, so giving rise to few symmetries by chance. The all-{alpha} structures lack the spatial register imposed by a hydrogen-bonded sheet and so will naturally be less symmetric in their packing. However, as the {alpha}-helix is a relatively large structure, smaller proteins (with less than six helices) will stand a good chance of acquiring a symmetric arrangement). The ß{alpha} unit combines both symmetry-inducing attributes of the previous types, having the spatial register of the ß-sheet while being relatively large, so that there will not be too many unsymmetric arrangements in a typical sized protein.

The unexpectedly high values for the smaller ß{alpha} model protein (Figures 3e and 8aGoGo) lends some support to this explanation. However, a more extensive analysis over a wider range of model structures will be necessary before any firm theoretical conclusions can be drawn.


    Conclusions
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
The Fourier analysis of protein structure provides a fast and automatic method for the detection of symmetry. It will prove to be most effective when integrated more fully with other methods including domain identification, the analysis of associated sequence repeats and multiple structure superposition.

As increasing numbers of protein structures are being determined in a high-throughput manner, the automatic analysis of symmetry will provide an additional tool to help annotate the emerging structures and aid in their comparison and classification.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 7. Power spectrum from eight ß{alpha} repeats. The test data from the last eight repeats in the protein 2bnh (Figures 4 and 6dGoGo) gave a clear 8-fold signal in the spectrum. Error bars are plotted at ±{sigma} (1 SD) as calculated from the variation seen across the rows in the score matrix. (The figure was produced by the program GNUPLOT.)

 

    Notes
 
3 Present address: IncyteGenomics Inc., Palo Alto, CA, USA Back

4 Present address: Accelrys Ltd, 230/250 The Quorum, Barnwell Road, Cambridge CB5 8RE, UK Back

1 To whom correspondence should be addressed. E-mail: wtaylor{at}nimr.mrc.ac.uk Back

1 In the following sections, the term ‘repeat’ will be largely used to refer to sequence repeats, while the term ‘symmetry’ will be used mainly for structural repeats. This distinction is made to emphasize that the measure of structural similarity used here is based on the comparison of 3D structures and captures more of the structural context than just the linear arrangement of secondary structures. For example, if the comparison matches two ß{alpha}ßß sub-structures, then these will have the same internal structure and will also hold a similar 3D relationship to the rest of the structure. This implies symmetry but does not specify which sort. Back

2 The original Rossmann fold was half a dinucleotide-binding domain {alpha})3. It is used here to refer to the double fold: 2 X (ß{alpha})3 , which constitutes an intact domain. Back


    Acknowledgments
 
T.P.F. was supported by the UK (MRC) HGMP. Sam Karlin is thanked for facilitating collaboration between W.R.T. and F.B. Janet Thornton and David Jones are thanked for useful discussion, as are present and past members of the Division of Mathematical Biology at NIMR, especially András Aszódi. Availability: The program codes and data used in this work can be obtained by FTP from ftp://mathbio.nimr.mrc.ac.uk.


    References
 Top
 Abstract
 Introduction
 Methods
 Random models
 Visualizing repeats
 Results
 Discussion
 Origin of structural symmetry
 Conclusions
 References
 
Banner,D.W., Bloomer,A.C., Petsko,G.A., Phillips,D.C., Pogson,C.I. and Wilson,I.A. (1975) Nature, 255, 609–614.[ISI][Medline]

Blundell,T.L. and Srinivasan,N. (1996) Proc. Natl Acad. Sci. USA, 93, 14243–14248.[Abstract/Free Full Text]

Cohen,F.E. and Sternberg,M.J.E. (1980) J. Mol. Biol., 138, 321–333.[ISI][Medline]

Cohen,F.E., Sternberg,M.J.E. and Taylor,W.R. (1982) J. Mol. Biol., 156, 821–862.[ISI][Medline]

Duncan,B.S. and Olson,A.J. (1993) Biopolymers, 33, 123–456.

Flores,T.P., Orengo,C.A., Moss,D.S. and Thornton,J.M. (1993) Protein Sci., 2, 1811—1826.[Abstract/Free Full Text]

Heringa,J. and Argos,P. (1993) Protein Struct. Funct. Genet., 17, 391–411.[ISI]

Heringa,J. and Taylor,W.R. (1997) Curr. Opin. Struct. Biol., 7, 416–421.[CrossRef][ISI][Medline]

Jonassen,I., Eidhammer,I., Grindhaug,S.H. and Taylor,W.R. (2000) J. Mol. Biol., 304, 599–619.[CrossRef][ISI][Medline]

Matthews,B.W. and Rossmann,M.G. (1985) Methods. Enzymol., 115, 397–420.[ISI][Medline]

May,A.C.W. (2001) Protein Eng., 14, 209–217.[Abstract/Free Full Text]

McLachlan,A.D. (1977) Biopolymers, 16, 1271–1297.[ISI][Medline]

McLachlan,A.D. (1979) J. Mol. Biol., 128, 49–79.[ISI][Medline]

McLachlan,A.D. (1983) J. Mol. Biol., 169, 15–30.[ISI][Medline]

McLachlan,A.D. and Stewart,M. (1977) J. Mol. Biol., 103, 271–298.[ISI]

Murzin,A.G. (1992) Protein Struct. Funct. Genet., 14, 191–201.[ISI]

Press,W.H., Flannery,B.P., Teukolsky,S.A. and Vetterling,W.T. (1986) Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK.

Rao,S.T. and Rossmann,M.G. (1973) J. Mol. Biol., 76, 241–256.[ISI][Medline]

Salem,G.M., Hutchinson,E.G. and Orengo,C.A. (1999) J. Mol. Biol., 287, 969–981.[CrossRef][ISI][Medline]

Taylor,W.R. (1991) Protein Eng., 4, 853–870.[Abstract]

Taylor,W.R. (1999) Protein Sci., 8, 654–665.[Abstract]

Taylor,W.R. (2001) J. Mol. Biol., 310, 1135–1150.[CrossRef][ISI][Medline]

Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Thornton,J. and Sibanda,B. (1983) J. Mol. Biol., 167, 443–460.[ISI][Medline]

Received August 28, 2001; revised November 2, 2001; accepted November 21, 2001.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (8)
Request Permissions
Google Scholar
Articles by Taylor, W. R.
Articles by Flores, T. P.
PubMed
PubMed Citation
Articles by Taylor, W. R.
Articles by Flores, T. P.