Protein structure alignment using environmental profiles

Jongsun Jung1 and Byungkook Lee2

Laboratory of Molecular Biology, Division of Basic Sciences, National Cancer Institute, National Institutes of Health, 37 Convent Drive, MSC 4255, Bethesda, MD 20892-4255, USA


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
A new protein structure alignment procedure is described. An initial alignment is made by comparing a one-dimensional list of primary, secondary and tertiary structural features (profiles) of two proteins, without explicitly considering the three-dimensional geometry of the structures. The alignment is then iteratively refined in the second step, in which new alignments are found by three-dimensional superposition of the structures based on the current alignment. This new procedure is fast enough to do all-against-all structural comparisons routinely. The procedure sometimes finds an alignment that suggests an evolutionary relationship and which is not normally obtained if only geometry is considered. All pair-wise comparisons were made among 3539 protein structural domains that represent all known protein structures. The resulting 3539 z-scores were used to cluster the proteins. The number of main clusters increased continuously as the z-cutoff was raised, but the number of multiple-member clusters showed a maximum at z-cutoff values of 5.0 and 5.5. When a z-cutoff value of 5.0 was used, the total number of main clusters was 2043, of which only 336 clusters had more than one member.

Keywords: comparison/clustering/protein structure alignment/SHEBA


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Protein structures are being determined at a very rapid rate; as of February 1999, there were more than 8600 proteins in the Protein Data Bank and the number is increasing daily. In order to understand and make effective use of this vast database, the protein structures must be compared and organized in some rational fashion. It is useful to have a fast, automatic structure comparison procedure for this purpose. Protein structure comparisons have been made since the very early days of protein crystallography. These pioneering early works have been reviewed (Matthews and Rossmann, 1985Go). Many of the new methods, including the method that will be presented in this paper, can be considered as refinements of these early works. However, these early methods are too slow to handle the volume of data that is now available.

Fast methods that employ many different strategies have now been reported. In the fragment assembly methods (Fischer et al., 1992Go; Holm and Sander, 1993Go; Boutonnet et al., 1995Go; ZuKang and Sippl, 1996Go; Shindyalov and Bourne, 1998Go), many or all small fragments from two structures are compared and those that are sufficiently similar are selected and assembled into a larger consistently similar set. Many other methods use a similar strategy but use only the secondary structural elements (Artymiuk et al., 1990Go; Rufino and Blundell, 1994Go; Madej et al., 1995Go; Mizuguchi and Go, 1995Go; Alesker et al., 1996Go; Alexandrov and Fischer, 1996Go; Koch et al., 1996Go). In another set of methods, an alignment for the whole structure is obtained and refined, typically using a dynamic programming algorithm (Satow et al., 1986Go; Taylor and Orengo, 1989Go; Zuker and Somorjai, 1989Go; Orengo et al., 1993Go; Holm and Sander, 1995Go; Suyama et al., 1997Go; Gerstein and Levitt, 1998Go). A problem in comparing structures that are only remotely similar is that there is no natural, objective way of deciding what is to be considered similar (Mizuguchi and Go, 1995Go). Different methods can compare the same two protein structures and conclude differently on their similarity. Even when two structures are judged to be similar, different structural alignments can be produced that appear equally valid (ZuKang and Sippl, 1996Go). Therefore, it is desirable to have many different methods available that focus on different features of the structure and use different sets of parameters for similarity comparisons (Madej et al., 1995Go).

All of the techniques cited above, except that by Suyama et al. (1997), use geometry alone for the comparison, ignoring sequence homology and similarity of the structural environment of the residues. On the other hand, Suyama et al. ignored the three-dimensional (3D) geometry altogether and compared structures on the basis of (position-sensitive) 3D profiles alone (Bowie et al., 1991Go), which included residue type-dependent solvent accessibility, hydrogen bonds, local secondary structural states and side-chain packing. The 3D profile matching is a one-dimensional (1D) process, much like the sequence–sequence alignment and they use the dynamic programming method (Needleman and Wunsch, 1970Go) for a fast alignment. The method produced results that are comparable to those from other methods, but with differences characteristic of the method. For example, two-domain structures, with rather different relative orientation of the two domains, could be aligned by the procedure. On the other hand, alignments are often inaccurate; they can be shifted by a few residues or sometimes by an entire secondary structural element (Suyama et al. 1997Go).

In the new method reported here, an initial alignment is generated using a profile method like that of Suyama et al. (1997), although we use generic rather than position-sensitive profiles. But this initial 1D alignment is then refined in a refinement cycle that obtains new alignments based on the 3D superposition of currently aligned residues. The profile used in the initial alignment includes sequence homology, similarity of the secondary structural states and the tertiary environment of each residue and is normalized differently from that used by Suyama et al. (1997). The 3D refinement is made using the dynamic programming method first used by Satow et al. (1986) and by many others since (Zuker and Somorjai, 1989Go; Rufino and Blundell, 1994Go; Holm and Sander, 1995Go; Alexandrov and Fischer, 1996Go; ZuKang and Sippl, 1996Go).

Because the method produces the final alignment on the basis of 3D comparisons, it should normally generate alignments that are similar to those produced by other purely geometry-based methods. On the other hand, if an alternative alignment exists that is chemically more sensible than a purely geometric alignment, this method may find the alternative alignment since it starts from an alignment based on the profiles. As will be described in this paper, we have found an instance of such a case in aligning a family of ferredoxins.

The method (SHEBA, for Structural Homology by Environment-Based Alignment) is fast. Its speed allows comparison of each of the 3539 domains, which represent all sequentially homologous groups in the March 1998 release of PDB, against all others. This is a sufficiently large number of structures that z-scores can be calculated for the alignment of a pair of proteins on the basis of aligning each structure of the pair against all others. We report the results of clustering all known protein structures on the basis of this measure of similarity.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Aligned protein pair database

A set of 68 benchmark protein pairs (Fischer and Eisenberg, 1996Go) was used to obtain parameters for the alignment score function. The two proteins of each pair have <25% sequence homology between them, but are clearly homologous in terms of the three-dimensional structure. The reference structural alignment for each pair of this dataset was obtained in a series of preliminary studies (see below). This dataset, along with the reference alignments, is referred to as the aligned protein pair (APP) database.

Scoring function

The alignment score (a-score) for a pair of proteins was calculated as the sum of scores for each aligned pair of residues. The a-score for a pair of aligned residues, i and i', was computed by


(1)

The first three terms in this expression represent scores for the sequence homology, for the similarity of the secondary structures to which the residues belong and for the similarity of the solvent accessibility and polarity of the environment of the residues (Bowie et al., 1991Go). These are explained below. gp is the opening gap penalty and {delta}ii' = 1 or 0 depending on whether there is or is not, respectively, a gap between residues i and i + 1 or residues i' and i' + 1. A gap was allowed between two residues only when the distance between the C{alpha} atoms of the corresponding residues in the other protein of the aligned pair (gap distance) was between 3.8 and 8 Å. This range of values was chosen from the frequency of gap distances observed in the APP database (data not shown). A gap extension penalty was not used. The procedure used to determine the value of gp will be described later. The constant term c was set to unity; when summed over all aligned residue pairs, this term gives the number of aligned residue pairs in the alignment. Inclusion of this term tends to maximize the number of aligned residues.

The values for the Hii', Sii' and Eii' terms were obtained by looking up the appropriate entries in tables of values that were pre-calculated using the APP database. In order to set up these tables, two residue pair lists were made from the APP database. One consisted of all the aligned residue pairs in all the protein pairs in the database. The second list was much larger and consisted of a concatenation of all possible npxnp' residue pairs of each protein pair p–p', where np and np' are the number of residues in the two proteins p and p'. These are referred to as the aligned and `random' residue pair lists, respectively. For the dataset used, the total number of aligned and random residue pairs was 9202 and 3 867 377, respectively. All the table values were calculated in a similar manner as the logarithms of the ratio of a normalized frequency (PA) observed in the aligned residue pair list versus the corresponding normalized frequency (PR) observed in the random residue pair list. When PA was <0.001 for any frequency, the ratio was set to 1.

Hii' was set equal to Hii' = H(Ri, Ri'), where Ri and Ri' are the amino acid types of residues i and i', respectively and H(Ri, Ri') is a sequence homology matrix. We used the matrix of Gonnet et al. (1992) for H in some preliminary studies, but the final results reported here were obtained using a new matrix computed from the APP database by


(2)

where PA(R, R') and PR(R, R') were the normalized frequencies with which the amino acid types R and R' were found paired in the aligned and random residue pair lists, respectively.

Similarly, Sii' was set to Sii' = S(si, si), where si and si' are the secondary structural states of residues i and i', respectively, and S is the secondary structural homology matrix. The S matrix was computed as


(3)

where PA(s, t) and PR(s, t) were the normalized frequencies with which a residue in the secondary structural state s was found paired with a residue in the secondary structural state t in the aligned and random residue pair lists, respectively. The secondary structural states used were the eight states defined by Kabsch and Sander (1983) and were assigned to each residue of the known structures using their DSSP program.

The Eii' term was calculated by


(4)

where fai and fai' are the accessible surface area fractions (see below) of residues i and i', respectively, and fpi and fpi' are the polar fractions of the environment (see below) of the residues i and i', respectively (Bowie et al., 1991Go). The accessible surface area fraction, fa, of a residue is the accessible surface area (Lee and Richards, 1971Go) of the residue divided by the maximum accessible surface area for the amino acid type of the residue. This latter area was the maximum observed for the same type of residue in the middle of some 110 000 tripeptides. This tripeptide database was generated from the protein structures selected from the November 1996 release of the EMBL list of representative proteins available through web site http://ftp.embl-heidelberg.de/pub/databases/pdb_select (Hobohm and Sander, 1994Go). The algorithm used for the accessible surface area calculation was that by Shrake and Rupley (1973). The area of a residue was calculated as the sum of the areas for the C{alpha} and the side-chain atoms.

The polar fraction of the environment, fp, is the fraction of the accessible surface area that is either buried by the polar atoms (N, O) or exposed to the solvent. When a surface patch was buried by both polar and non-polar atoms, it was counted as polar.

The function E was given by


(5)

where PA({Delta}fa) and PR({Delta}fa) were the normalized frequencies of residue pairs with an fa difference {Delta}fa in the aligned and random residue pair lists, respectively, and PA({Delta}fp) and PR({Delta}fp) were defined similarly using the fp difference {Delta}fp.

Wh, ws and we in Equation 1Go are the relative weights for the corresponding terms. The optimum values for these and for gp were determined by minimizing the alignment shifts. The alignment shift of an alignment (trial alignment) with respect to another (reference alignment) is defined by


(6)

where i and i' are a pair of residues which are aligned in the reference alignment, but which are separated by {Delta}rii' number of residues and gaps in the trial alignment, N is the total number of aligned residue pairs in the reference alignment and the summation is over the N residue pairs. Alignment shifts were calculated after the initial alignment (see the alignment procedure in the next section) for the 68 protein pairs of the APP database. By systematically searching the entire range of possible values in discrete steps, it was found that the overall alignment shift was the smallest when wh, ws, we and gp were set to 0.5, 1.0, 1.5 and 5.0, respectively. Thus, the optimum weight is the largest for the accessibility and polar environment term (we) and smallest for the sequence homology term (wh). Small wh was expected since these pairs of proteins were selected to have low sequence homology between them (Fischer and Eisenberg, 1996Go). On the other hand, the optimum wh value is definitely non-zero since setting it to zero gives clearly worse results (data not shown). This indicates that use of the sequence homology is helpful in aligning structures even when they appear to have little sequence homology.

Alignment procedure

The optimum alignment between two protein structures was obtained by an iterative procedure which involved the following three steps: in the first step, the initial alignment was obtained by maximizing the a-score using a fast dynamic programming procedure based on the Needleman–Wunsch algorithm (Needleman and Wunsch, 1970Go). In the second step, Kabsch's equations (Kabsch, 1976Go, 1978Go) were used to translate and rotate one structure against the other so as to minimize a weighted mean square distance between the C{alpha} atoms of the aligned residues. The weighting scheme used is described in the next paragraph. In the third step, a new alignment was obtained from this pair of superimposed structures by using the Needleman–Wunsch algorithm that maximized the number of residue pairs whose C{alpha} distances were <3.5 Å. The gap penalty was found to be unnecessary for this procedure. Steps two and three were repeated. The structure superposition/realignment cycle converged rapidly: for the 68 test proteins, the average number of aligned residues after just three cycles was within 0.2 residue of that after nine cycles (data not shown). The results reported in this paper are those obtained using three cycles of refinement.

It should be noted that the initial step involves only one-dimensional (1D) alignment and ignores three-dimensional (3D) relationships among different structural elements. For some protein pairs, the initial 1D alignment produces an alignment that is incompatible in 3D and which the subsequent 3D–1D superposition cycles could not resolve. Therefore, we used five different weighting schemes in the Kabsch superposition step: (1) all residues weighted equally, (2) the N-terminal and C-terminal halves weighted in a 1.0:0.1 ratio, (3) the N- and C-terminal halves weighted 0.1:1.0, (4) the N-terminal quarter, middle half and C-terminal quarter weighted 0.1:1.0:0.1 and (5) the N-terminal quarter, middle half and C-terminal quarter weighted 1.0:0.1:1.0. The final alignment chosen was the best among all five alignments generated by these different Kabsch-step weighting schemes.

The degree of structural similarity between two optimally aligned proteins was measured by means of the number of aligned residue pairs (m-value). A pair of residues was considered aligned if their C{alpha} atoms were <3.5 Å apart. The m-score was defined as the m-value divided by the size (number of amino acids) of the larger protein.

Reference alignments for the 68 protein pairs

The reference alignments were generated using the alignment procedure described above, but using a scoring function that did not require knowledge of the reference alignment. Specifically, the matrix by Gonnet et al. (1992) was used for H(a, b), S(s, t) was set to 1 if s and t were equal and 0 if not and E({Delta}fa, {Delta}fp) was set equal to the negative of the Euclidean distance in the two-dimensional space of {Delta}fa and {Delta}fp. For some protein pairs, this procedure failed and the initial alignment had to be found manually using the in-house graphics program GEMM (Syi and Lee, 1988Go). During the refinement of the initial alignment, the three weights and the gap penalty were varied specifically for each protein pair. After alignments were found for all the proteins, the S, H and E scoring matrices were determined and new alignments were obtained using the new scoring function.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Tests with the 68 aligned pairs of proteins

The new structure alignment procedure was tested using the 68 protein pairs (Fischer and Eisenberg, 1996Go) that were used to obtain the score function and the weight and gap penalty parameters. The computer time taken to align a protein pair was <1 s for most of the proteins (Figure 1Go). In the reference alignment, the number of aligned residues (m-value) ranged from 26 to 91% of the number of residues of the larger of the two proteins. When the same pairs were aligned using SHEBA, 63 pairs were aligned with m-values within 5% of the reference alignment. The m-values for six other pairs deviated 5–11% from the reference alignment, but visual inspection clearly indicated that the two alignments were essentially the same for all six pairs. For the remaining two pairs (1aep–256bA and 5fd1–2fxb), SHEBA gave significantly lower match scores than the reference alignment.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. Computer CPU time taken to align each of the 68 pairs of proteins using an SGI Origin R10000 processor.

 
The proteins 256bA and 1aep have up-and-down four- and five-helix bundle structures, respectively. In the reference alignment, helices 2, 3 and 4 of 256bA are aligned to helices 2, 3 and 4 of 1aep, respectively. This alignment puts helix 1 of 256bA nearly in the same physical space that helix 5 of 1aep occupies. In contrast, SHEBA gives the alternative alignment in which helices 2, 3 and 4 of 256b are aligned to helices 1, 4 and 5 of 1aep. The two proteins have a low sequence homology [14% according to Fischer and Eisenberg (1996) and 9% in the reference alignment] and unrelated functions (1aep is a lipoprotein whereas 256b is a cytochrome). The structural homology detected here by either alignment is likely to be accidental owing to the simple up-and-down helix bundle architecture of the structures.

The other pair (5fdl–2fxb) are ferredoxins, for which SHEBA produces another alternative alignment as shown in Figure 2Go. The 5fd1 protein has two reaction centers. One (Fe3S4) contains three irons and three cysteine residues, as well as four sulfur atoms and the other (Fe4S4) four irons, four cysteine residues and four sulfur atoms, shown respectively at the bottom and top of the 5fd1 structure in Figure 2Go. The 2fxb protein has only one reaction center of the Fe4S4 type. In the reference alignment, all four ß-strands and two helices of 2fxb are aligned to the first four ß-strands and two helices of 5fd1 sequentially, resulting in 51 matched residues. However, this aligns the Fe4S4 center of 2fxb with the Fe3S4 center of 5fd1 and matches only three cysteine residues (residues 11, 14 and 61 of 2fxb aligned to the residues 8, 16 and 49 of 5fd1, respectively). In the SHEBA alignment, the Fe4S4 center of 2fxb is aligned to the Fe4S4 center of 5fd1. Again, only three cysteines are aligned (residues 11, 14 and 17 of 2fxb aligned to the residues 39, 42, 45 of 5fd1, respectively), but the spacings between the aligned residues are the same in the two structures and the geometry of the loop that spans these residues is more similar. This alignment matches the N-terminus of 2fxb to the middle of 5fd1 (Figure 2Go) and aligns only the first two ß-strands and one helix, producing 34 matched residues. However, the C-terminal half of 2fxb becomes aligned to the N-terminal half of 5fd1 in the three-dimensional structure (Figure 2Go). Indeed, if 2fxb molecule is circularly permutated by joining the C- and N-termini and by cutting the protein at a point between the second and third ß-strands (Figure 3Go), the number of matched residues increases to 54. This alignment also aligns C61 of 2fxb to C20 of 5fd1, thus matching all four cysteines of the Fe4S4 reaction center of both proteins. We found a number of other ferredoxins that also align to 5fd1 after a similar circular permutation (Table IGo). The protein 1blu is particularly notable. It is of similar length to 2fxb and matches 53 residues after the circular permutation, of which 23 (43%) are identical (Figure 4Go).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2. Straight and circularly permutated alignments of two ferredoxins. The Fe–S clusters are shown in yellow and the matched parts of the polypeptide chain in matching blue and green. The two orientations of the structure for 2fxb shown in the left and right flanks are related by a 180° rotation around the viewing axis. The same N-terminal half of 2fxb is colored green for the left molecule and blue for the right molecule. The left 2fxb aligns to 5fd1 straightforwardly (reference alignment) whereas the right 2fxb aligns to 5fd1 in a circularly permutated manner (SHEBA alignment).

 


View larger version (8K):
[in this window]
[in a new window]
 
Fig. 3. Secondary structures along the sequence of 5fd1 (a) and 2fxb (b). Black arrows represent the ß-strands and open boxes the {alpha}-helices. The triangular and square symbols represent the Fe3S4 and Fe4S4 clusters respectively. In the circularly permuted version of 2fxb, the N- and C- termini are joined and new termini are made by cutting the protein at the point indicated by the dotted line.

 

View this table:
[in this window]
[in a new window]
 
Table I. Alignments of some members of the ferredoxin family to 5fd1
 


View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4. Aligned sequences of 5fd1 and a circularly permutated 1blu. The arrow indicates the position of the C- and N-termini of the original 1blu. Aligned residues are indicated by the stars and vertical lines (for cysteines). Identities are boxed in gray. The residues that form {alpha}-helices and ß-strands are indicated by the corresponding Greek letters above and below the sequences. The 5fd1 sequence is truncated after the Glu (57) in the alignment.

 
Comparison with other structure alignment procedures

The performance of the new procedure was further tested by comparing the alignments it produced with those published using four other structure alignment procedures (Table IIGo). The alignments using SHEBA were essentially the same as those found by Dali (Holm and Sander, 1993Go), the MCL method (Boutonnet et al., 1995Go), VAST (Gibrat et al., 1996Go) and one of the two alternative alignments by ProSup (ZuKang and Sippl, 1996Go) in all but three or four cases compared. The exceptional cases are the 2tmvP–256bA, 3chy–1rcf, 1aba–1pbf and 1acx–1tnfA pairs, which are described below.


View this table:
[in this window]
[in a new window]
 
Table II. Comparison of alignments produced by SHEBA and other programs
 
The proteins 2tmvP and 256bA have a four-helix bundle structure of identical topology and ProSup and SHEBA give alignments which match the same helix pairs from the two proteins. However, the whole helix bundle of 256bA is shifted by seven residues in one ProSup alignment and by four residues in the other, compared with that in the SHEBA alignment. The residue numbers of some of the aligned pairs of 2tmvP–256bA are 123–96, 124–97 and 125–98 in one and 123–93, 124–94 and 125–95 in the other ProSup alignments and 123–89, 124–90 and 125–91 in the SHEBA alignment. All three alignments appear to be about equally valid, as judged visually and by the number of matched residues, which are 64, 62 and 68, respectively, for the above three alignments.

The alignment obtained for the 3chy–1rcf pair is also an alternative alignment, but of different type. Both structures have an {alpha}/ß sandwich fold, in which a five-stranded ß-sheet is surrounded by helices 1 and 5 on one side and by helices 2, 3 and 4 on the other side. The protein 1rcf has, in addition, an extended loop after helix 4. The ProSup alignment matches helices 1–1, 2–2, 3–3, 4–4 and 5–5 in one alignment and 1–1, 2–2, 3–3 and 5–5 in the other. In the SHEBA alignment, helices match as 1–1, 2–2, 3–4, 5–5 and the helix 4 in 3chy occupies the space occupied by the extended loop of 1rcf. Again, it is difficult to tell which of the three alignments is better by visual inspection. The numbers of matched residues are 75, 74 and 82 for the above three alignments, respectively.

For the 1aba–1pbf pair, VAST and SHEBA give alignments that appear visually similar, but they were not compared quantitatively because the VAST alignment was not readily available for this pair. SHEBA failed to recognize the 1acx–1tnfA pair as being structurally homologous. The failure is understandable because this pair is structurally homologous only when the topological difference is ignored (Holm and Sander, 1993Go) whereas SHEBA is inherently sensitive to chain connectivity since it uses the dynamic programming algorithm.

All-against-all comparison and clustering

All-against-all protein structure comparisons were made using domains produced by the domain parsing program PUU (Holm and Sander, 1994Go). The 18 595 protein domains, representing all protein chains of 40 residues or more in the March 1, 1998 release of the Protein Data Bank (PDB), were grouped into 3539 high sequence homology groups (those with at least 50% sequence identity with the representative member of the group). The sequence comparisons made in this process used a fast Needleman–Wunsch algorithm (Needleman and Wunsch, 1970Go). The smallest protein was selected as the representative for each group and SHEBA was run for all pairs between these sequence representatives. The degree of structural homology between any two proteins was measured by the m-score, which is defined as the number of structurally aligned residues divided by the number of residues in the larger protein of the two. For clustering, z-scores were used as the distance measure between any two proteins. The z-score zab between protein a and protein b is the m-score between the two, measured in units of the root mean square deviation from the mean of the 3539 m-scores between protein a and all other proteins. We used z-scores rather than m-scores because the random noise background distribution was more uniform among different proteins when z-scores were used than when m-scores were used.

Clustering was done in two stages: main clusters were formed first and then subdivided into sub-clusters. A main cluster is defined as the complete collection of all sequence representatives, all of which are related to all others by a chain of neighbor relations; two proteins from two different main clusters are not related to each other by any chain of neighbor relations. The core of a sub-cluster is defined as the group of proteins, all of which are direct neighbors of each other. A sub-cluster was formed from a core by including the orphans (those which do not belong to any core) that shared more neighbor relations to the core than to any other core.

The number of main clusters and the size of the largest clusters are plotted as a function of the zcutoff value in Figure 5Go. As expected, all proteins are grouped into a few large clusters when a low zcutoff value is used, whereas almost all proteins form separate clusters when a high zcutoff value is used. In the medium range of zcutoff values, the number of clusters increases continuously with the zcutoff value. However, if single-member clusters are excluded, the number of clusters remains nearly constant in the range of zcutoff values of 4.5–6. We chose the zcutoff value of 5 mainly because the size of the largest cluster changes quickly in the zcutoff value range 3–5 and slowly beyond 5. Also, the number of multi-member clusters is nearly at the maximum and rather insensitive to the zcutoff value when the zcutoff value is 5. With this zcutoff value, the number of main clusters is 2043, of which 336 are multi-member clusters.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 5. The size of the largest cluster (diamonds) and the number of main clusters (squares), single-member clusters (circles) and multi-member clusters (triangles) as a function of zcutoff.

 
Neighbor relations among the 3539 sequence representatives are shown in Figure 6Go after the proteins had been sorted according to their secondary structure content and clustered. The structures of some of the proteins are also shown in Figure 6Go in Molscript representation (Kraulis, 1991Go). The size and location in the sorted list of the 10 largest clusters are given in Table IIIGo along with the names of typical proteins that belong to the cluster. The size of a cluster given here refers to the number of different sequence representatives. The actual number of proteins that have the similar fold is considerably larger since each representative represents many proteins with high sequence homology. We examine two sample clusters in more detail below.



View larger version (49K):
[in this window]
[in a new window]
 
Fig. 6. Neighbor relations among the 3539 sequence representatives after they are sorted and clustered. The proteins were first sorted in descending order of ß–{alpha}, where ß and {alpha} are the fractions of residues in the ß-sheet and {alpha}-helices, respectively. They were then clustered into the main and sub-clusters as described in the Methods section. The x- and y-axes of the graph are the serial numbers of the proteins in this sorted list. Each dot indicates that the corresponding pair of proteins are neighbors of each other. The zcutoff value used for the neighbor criterion was 5.0. Typical structures from a few large clusters are shown in Molscript drawings.

 

View this table:
[in this window]
[in a new window]
 
Table III. Ten largest main clusters
 
The largest cluster contains the immunoglobulin fold. A magnified view of this cluster is shown in Figure 7Go. This cluster includes the viral coat proteins (sub-clusters a, b and c), lectins (d), immunoglobulin-like molecules (e–h), azurins (i) and plastocyanins (j). These are all ß-sheet proteins of the Greek key topology (Richardson, 1981Go). The core members of the sub-clusters (e–h) are listed in Table IVGo. Each sub-cluster has a unique structural feature that is shared by all members of the sub-cluster. These features are noted in Table IVGo and also in the Molscript drawings of the structures shown in Figure 7Go. The sub-classification of the immunoglobulin-like domains found here are similar to, but not identical with, those found by Bork et al. (1994) using DALI. Figure 8Go shows the chain of neighbor relations that connects a lectin, which has a typical jelly-roll fold (Richardson, 1981Go), to a protein with an immoglobulin fold. It shows that a jelly-roll fold can be converted to the immunoglobulin fold simply by treating the innermost part of the jelly roll (the unshaded portion of the molecules in Figure 8Go) as an insertion to a structure of the immunoglobulin fold. The similarity between the structures of the diphteria toxin domain R (1ddt), which is sequentially homologous to 1sgk-6 and has the jelly-roll fold, and the immunoglobulins has been noted (Choe et al., 1992Go). Also, the two most similar structural homologues of 1ddt are 1nbc-A and 1msp-A according to DALI. 1nbc-A is a jelly-roll and a direct neighbor of 1ulo-0 according to SHEBA whereas 1msp-A has an immunoglobulin fold (Holm and Sander, 1998Go).



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 7. A magnified view of the region that includes the largest cluster (2–186). The neighbor relations among the core members are shown as black dots and orphan members brown. The 10 largest sub-clusters are labeled. The Molscript drawings show the structure of an immunoglobulin Fv (e), a VCAM (f), an immunoglobulin Fc (g) and a fibronectin (h) domain.

 

View this table:
[in this window]
[in a new window]
 
Table IV. Four immunoglobulin sub-clusters
 


View larger version (28K):
[in this window]
[in a new window]
 
Fig. 8. Structures of proteins that form a chain of neighbor relations that connect a lectin molecule to a molecule with the immunoglobulin fold. Each protein is a direct neighbor of the proteins shown on either side. The proteins are, from the left, lectin (1ioaA0), cellulose degradation protein (1ulo-0), glycosidase (1bhgB1), diphtheria toxin domain R (1skg-6) and a cell motility protein (1mspB0). The five proteins were superimposed together using SHEBA and moved apart in the x-axis direction using GEMM. Molscript was used for the drawings. The matched parts of the structures are in color, blue for strands in the back sheet and green for those in the front sheet.

 
The second largest cluster contains mostly nucleotide binding oxidoreductases, but it also includes a sub-cluster which is made mainly of the second domain of the sugar-binding proteins. The identities of these latter proteins and all members of the sequentially homologous groups that they represent, are given in Table V(a)Go. The m-score and the number of sequence identities after alignment are given in Table V(b)Go. As can be seen and noted by others (Saper and Quiocho, 1983Go; Vyas et al., 1991Go; Pearl et al., 1994Go; Shilton et al., 1996Go), many of these proteins have little sequence homology to one another but are structurally homologous and have similar biochemical function.


View this table:
[in this window]
[in a new window]
 
Table V. A sub-cluster (950–956) in the dehydrogenase fold
 
Additional information

The SHEBA program package can be found at our web site (http://lmbbi.nci.nih.gov). The following additional information is also available at this web site: the scoring matrices H, S and E; the reference and SHEBA alignments for the 68 protein pairs of Fischer and Eisenberg (1996); and the sorted list of the 3539 protein domains with cluster boundaries.


    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
This new procedure is unique among protein structure comparison methods in that it starts with an alignment that includes sequence homology and solvent exposure. When there are alternative alignments, a geometrically best one and another which has a poorer geometrical fit but which preserves more of the sequence homology, SHEBA may well choose the latter. The alignment of ferredoxins is an example of such a case. As described in the Results section, SHEBA matches only half of the protein 2fxb to the protein 5fd1, even though both halves can be matched geometrically (Figure 2Go). The missing half can also be aligned if the protein is circularly permuted. Also, there is another protein, 1blu, which aligns to 5fd1 with 43% sequence identity when circularly permuted in a similar manner (Table IGo, Figure 4Go). These observations strongly suggest that 1blu and 5fd1 evolved after one had been circularly permuted and that the alignments that SHEBA finds between these proteins and between 2fxb and 5fd1 are meaningful.

On the other hand, SHEBA is not overly dependent on sequence homology. An example is the case of the sugar-binding proteins. It is well known (Saper and Quiocho, 1983Go; Vyas et al., 1991Go; Pearl et al., 1994Go; Shilton et al., 1996Go) that some of these proteins have very low sequence homology to others in the group, yet have closely similar structures. SHEBA correctly groups these proteins into one sub-cluster with rather high m-scores between them (Table VGo).

The clustering was done at two levels. A main cluster consists of all proteins that are related to others in the cluster by a chain of neighbor relations. Sub-clusters are formed from core proteins, which are direct neighbors of each other. We did not attempt to build a more elaborate classification scheme and it is difficult to know how these main and sub-clusters relate to the hierarchical schemes such as SCOP (Murzin et al., 1995Go) and CATH (Orengo et al., 1993Go). The number of main clusters varies continuously with the zcutoff value (Figure 5Go), but the variation is mainly due to the change in the number of single-member clusters. The number of multi-member clusters is much less sensitive as long as the zcutoff value is >4. When the zcutoff value is set to 5, the total number of main clusters is 2043 and that of multi-member clusters is 336. These numbers are roughly comparable to the 1000 different folds estimated by Chothia (1992), 597 superfamilies and families in the October 1997 release of the SCOP database (Murzin et al., 1995Go) and the 1577 sequence families and 949 homologous superfamilies in CATH (Orengo et al., 1993Go). The data presented in Figure 5Go suggest that one main cause of the variation among these different estimates is the ambiguity in the classification of the single-member clusters. (Single members do not necessarily represent a single experimental structure since each is a representative of a set of structures which have high sequence homology.)

There were 27 781 protein pairs with z-scores >4.0. Of these, 10 379 pairs share <10% sequence identity. These protein structure pairs provide a valuable database, which can be used to study the sequence–structure relations among remotely homologous proteins.


    Notes
 
1 This work is part of the PhD research of Jongsun Jung at the Department of Chemistry, American University, 4400 Massachusetts Avenue, Washington, DC 20016, USA Back

2 To whom correspondence should be addressed. E-mail: bk{at}nih.gov Back


    Acknowledgments
 
We thank Dr Liisa Holm for a copy of the program PUU. We also thank Mr Joe J.Cammisa for his assistance with the computer system support and other members of the Laboratory for numerous discussions and criticisms. Jongsun Jung thanks Professors Frederick W.Carson, Lou Thompson Hughes and Nina M.Roscher of the American University for their encouragement.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Alesker,V., Nussinov R. and Wolfson H.J. (1996) Protein Eng., 9, 1103–1119.[Abstract]

Alexandrov,N.N. and Fischer D. (1996) Proteins, 25, 354–365.[ISI][Medline]

Artymiuk,P.J., Rice D.W., Mitchell E.M. and Willett P. (1990) Protein Eng., 4, 39–43.[Abstract]

Bork,P., Holm L. and Sander C. (1994) J. Mol. Biol., 242, 309–320.[ISI][Medline]

Boutonnet,N.S., Rooman M.J., Ochagavia M.E., Richelle J. and Wodak S.J. (1995) Protein Eng., 8, 647–662.[Abstract]

Bowie,J.U., Luthy R. and Eisenberg D. (1991) Science, 253, 164–169.[ISI][Medline]

Branden,C. and Tooze J. (1999). Introduction to Protein Structure. Garland Publishing, New York, pp. 299–323

Choe,S., Bennett M.J., Fujii G., Curmi P.M., Kantardjieff K.A., Collier R.J. and Eisenberg D. (1992) Nature, 357, 216–222.[ISI][Medline]

Chothia,C. (1992) Nature, 357, 543–544.[ISI][Medline]

Fischer,D. and Eisenberg D. (1996) Protein Sci., 5, 947–955.[Abstract/Free Full Text]

Fischer,D., Bachar O., Nussinov R. and Wolfson H. (1992) J. Biomol. Struct. Dyn., 9, 769–789.[ISI][Medline]

Gerstein,M. and Levitt M. (1998) Protein Sci., 7, 445–456.[Abstract/Free Full Text]

Gibrat,J.F., Madej T. and Bryant S.H. (1996) Curr. Opin. Struct. Biol., 6, 377–385.[ISI][Medline]

Gonnet,G.H., Cohen M.A. and Benner S.A. (1992) Science, 256, 1443–1445.[ISI][Medline]

Hobohm,U. and Sander C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Holm,L. and Sander C. (1993) J. Mol. Biol., 233, 123–138.[ISI][Medline]

Holm,L. and Sander C. (1994) Proteins, 19, 256–268.[ISI][Medline]

Holm,L. and Sander C. (1995) ISMB, 3, 179–187.[Medline]

Holm,L. and Sander C. (1998) Proteins, 33, 88–96.[ISI][Medline]

Kabsch,W. (1976) Acta Crystallogr., A32, 922–923.[ISI]

Kabsch,W. (1978) Acta Crystallogr., A34, 827–828.[ISI]

Kabsch,W. and Sander C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Koch,I., Lengauer T. and Wanke E. (1996) J. Comput. Biol., 3, 289–306.[ISI][Medline]

Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950.[ISI]

Lee,B. and Richards F.M. (1971) J. Mol. Biol., 55, 379–400.[ISI][Medline]

Madej,T., Gibrat J.F. and Bryant S.H. (1995) Proteins, 23, 356–369.[ISI][Medline]

Matthews,B. and Rossmann M. (1985) Methods Enzymol., 115, 397–420.[ISI][Medline]

Mizuguchi,K. and Go N. (1995) Protein Eng., 8, 353–362.[Abstract]

Murzin,A.G., Brenner S.E., Hubbard T. and Chothia C. (1995) J. Mol. Biol., 247, 546–540.

Needleman,S.B. and Wunsch C.D. (1970) J. Mol. Biol., 48, 443–453.[ISI][Medline]

Orengo,C.A., Flores T.P., Taylor W.R. and Thornton J.M. (1993) Protein Eng., 6, 485–500.[Abstract]

Pearl,L., O'Hara B., Drew R. and Wilson S. (1994) EMBO J., 13, 5810–5817.[Abstract]

Richardson,J.S. (1981) Adv. Protein Chem., 34, 167–339.[Medline]

Rufino,S.D. and Blundell T.L. (1994) J. Comput. Aided Mol. Des., 8, 5–27.[ISI][Medline]

Saper,M.A. and Quiocho F.A. (1983) J. Biol. Chem., 258, 11057–11062.[Abstract/Free Full Text]

Satow,Y., Cohen G.H., Padlan E.A. and Davies D.R. (1986) J. Mol. Biol., 190, 593–604.[ISI][Medline]

Shilton,B.H., Flocco M.M., Nilsson M. and Mowbray S.L. (1996) J. Mol. Biol., 264, 350–363.[ISI][Medline]

Shindyalov,I.N. and Bourne P.E. (1998) Protein Eng., 9, 739–747.

Shrake,A. and Rupley J.A. (1973) J. Mol. Biol., 79, 351–372.[ISI][Medline]

Suyama,M., Matsuo Y. and Nishikawa K. (1997) J. Mol. Evolut., 44, 163–173.

Syi,J.L. and Lee,B. (1988) J. Mol. Graph., 6, 226–226.

Taylor,W.R. and Orengo C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Vyas,N.K., Vyas M.N. and Quiocho F.A. (1991) J. Biol. Chem., 266, 5226–5237.[Abstract/Free Full Text]

ZuKang,F. and Sippl M.J. (1996) Fold. Des., 26, 451–461.

Zuker,M. and Somorjai R.L. (1989) Bull. Math. Biol., 51, 55–78.[ISI][Medline]

Received February 1, 2000; revised June 6, 2000; accepted June 10, 2000.