Max-Planck-Institut für Informatik, Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany
1 To whom correspondence should be addressed. E-mail: doming{at}mpi-sb.mpg.de
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: alternative models/clustering/distance matrix/hierarchical clustering/structure classification
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
It is now common that alternative experimental models of the same protein are available in the PDB. These models can display considerable structural differences as, in general, they correspond to different crystal forms, different physicochemical conditions, different mutants and proteins forming different complexes or binding to different ligands. Currently no database is available that classifies these alternative models according to their structure relationships. In the current SCOP release (1.65, December 2003), 75% of sets at the species level have at least two entries (two or more alternative structural models for a protein domain) and on average there are 6.7 entries per species level. The problem is becoming more relevant as the average number of alternative structures for a given domain is expected to increase.
In recent years, clustering techniques have been applied to classify protein structures. Clustering has been used to automate the classification of proteins in different folds and families or in the analysis of the trajectories from molecular dynamics simulations; see, for example, May (1999), Laboulais et al. (2002) and Choi et al. (2004)
. The method of Carugo and Pongor (2002)
, in particular, is based on the comparison of the distributions of C
coordinates between two proteins to estimate the structural similarity of different proteins in an efficient way. Clustering techniques have also been applied to determine representatives for ensembles of NMR-derived structures (Kelley et al., 1996
). The all-atom root-mean-square deviation (rmsd) is used as a dissimilarity measure for hierarchical clustering. The models are grouped into different clusters and a representative for each cluster is given (model closest to the centroid of each cluster). The OLDERADO database (Kelley and Sutcliffe, 1997
) provides the results for the NMR ensembles available in the PDB. So far, these methods have not been applied to clustering alternative structure models (determined by protein crystallography or NMR spectroscopy) available for each protein domain available in the PDB.
Here we propose STRuster, a method for clustering alternative structural models corresponding to different structure determination experiments. The structures are classified according to backbone structure similarity using C distance matrices. The dissimilarity measure used for clustering is based on the Euclidean distance for each pair of C
coordinates. Filters are applied in order to render the method sensitive to a wide range of backbone conformational changes. The method has been applied to each SCOP species level and the results are available online. These results can be useful for guiding further structure determination experiments, in the design and interpretation of mutational experiments, in the selection of models for docking or in the selection of templates for structure prediction. More generally, the results can be useful in the selection of a non-redundant set of protein structures.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The SCOP classification database, release 1.65 (December 2003), is used. Analysis is restricted to the first seven classes, the true classes. Therefore, we excluded coiled-coil proteins, low-resolution protein structures, peptides and designed proteins, which are not true classes. Of 7705 sets corresponding to the SCOP species level, 4187 with at least three domain structural models (entries) are clustered. For each set, entries are compared with the one with the largest number of residues. Only the entries with at least 80% of the number of residues and at least 90% sequence identity to the largest entry are considered. The program ALIGN (Myers and Miller, 1989) is used to align the sequences and to determine the equivalent residues. The final number of species sets to be clustered is 3716, comprising a total of 36 531 entries. The ASTRAL SCOP 1.65 PDB-style files (Chandonia et al., 2004
) are used as the source of the coordinates for each SCOP entry.
Dissimilarity measure
For each SCOP species set, the C distance matrices are calculated for all entries. Consider the C
coordinates for residue i, (xi, yi, zi). The Euclidean distance between the C
atoms of residues i and j in entry a is
. In order to reduce theinfluence of differences in large distances associated with extensive conformational changes, a first filter is applied with cut-off F1, resulting in D'ij(a):
![]() |
For each pair of entries a, b the absolute difference is then calculated for each residue pair: . Only residues that can be aligned to the largest entry in the set (used as reference) are considered. If one of the entries includes an insertion, the corresponding residues in the insertion are not considered. A second filter is then applied with cut-off F2 in order to restrict the analysis to significant structural differences:
![]() |
In this study, we set F1 = 14.0 and F2 = 1.0. The matrix M contains the dissimilarity values of all pairs involving the N entries in the set, where M(a,b) corresponds to the dissimilarity between entries a and b with L aligned residues:
![]() |
Clustering
The R programming environment for data analysis (version 1.8.1) is used for clustering (http://www.r-project.org/). The dissimilarity matrix M is used as input for two alternative clustering methods. The first is a hierarchical method, using group average agglomeration (Gordon, 1999). Dendrograms are generated for visualization of the hierarchical dependencies in the data. The second method, partitioning around medoids (PAM), is applied in order to obtain the optimal number of clusters where the entries are grouped in a robust way (Kaufman and Rousseeuw, 1990
; Struyf et al., 1997
).
PAM is a partitioning algorithm and can be regarded as a generalization of K-means clustering to arbitrary dissimilarity matrices. The goal is to minimize the objective function , where the sum is taken over all entries a1,...,aN in the protein set and m1,...,mk are k appropriately chosen representatives (medoids) from the set. The algorithm consists of two steps. In the BUILD step, k initial medoids are sequentially selected. In the SWAP step, the objective function is minimized iteratively by replacing one medoid with another entry. This step is repeated until convergence.
The silhouette width value is a measure of cluster validity (Rousseeuw, 1987) and is used to select the best number of clusters obtained with the PAM algorithm. Assume that we have a clustering of N protein entries into k clusters, such that an entry a belongs to cluster C of size r. The average dissimilarity between a and all other entries in cluster C is
![]() |
![]() |
![]() |
![]() |
![]() |
Silhouette values lie in the range [1, 1]. Entries with a silhouette value s(a) close to 1.0 are well clustered, in the sense that the average distance to entries in the same cluster is small, compared to average distance to the closest other cluster. If the silhouette value is smaller than 0, the entry is misclassified. PAM clustering is applied for all numbers of clusters k between 1 and N 1 and the corresponding average silhouette values are calculated. The best clustering corresponds to k* number of clusters:
.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Distribution of set size and dissimilarity values
Figure 1 shows the histogram of sizes of the protein sets that have been clustered. Most of the sets (76%) have 10 entries. The percentage of sets with >40 entries is small (2%). The largest set, with 309 entries, results from the extensive protein engineering work on bacteriophage T4 lysozyme, with SCOP unique identifier (sunid) 53983. All these sets have been clustered and the results are available online at http://bioinf.mpi-sb.mpg.de/projects/struster/.
|
|
The rmsd after rigid-body superposition is a popular measure for expressing the structure similarity between proteins. Let d be the distance between each pair of equivalent atoms in two optimally superposed structures. The rmsd over n equivalent atoms is defined as
![]() |
In the present work, the structural models are clustered based on a measure of dissimilarity M(a,b) between structures. This measure is sensitive to both large and small (but still significant) backbone conformational differences. It reflects the extent of significant differences (>1 Å) in short- to medium-range intramolecular distances between the two structures a and b. The value for normalized dissimilarity R(a,b) = M(a,b)/L2 is easier to interpret. It is independent of the number of equivalent residues L in the set and its value lies in the interval between 0 (identity) and 1 (maximum difference).
Figure 2 shows the relationship between rmsd and both the dissimilarity measure M(a,b) (Figure 2A) and the normalized dissimilarity R(a,b) (Figure 2B). There is considerable positive correlation in both cases (0.813 for A and 0.815 for B).
|
D-2-Deoxyribose-5-phosphate aldolase
The structure models for Escherichia coli D-2-deoxyribose-5-phosphate aldolase provide a first example of the clustering analysis. Figure 3 shows the structural superposition of the six structures found in the corresponding SCOP species level (sunid 69395) and the dendrogram obtained from hierarchical clustering. The models present small structural differences. Nevertheless two clusters, C1 and C2, can be distinguished in the dendrogram. The same two clusters are produced by PAM clustering, with an average silhouette width of 1.0, indicating two well-differentiated clusters. Larger numbers of clusters produce lower average silhouette width values.
|
Serum transferrin
Transferrins are responsible for sequestering and solubilizing iron. In particular, serum transferrin binds Fe(III) in the blood and transports it to cells, where it is released at low pH into the endosome. The iron-free apotransferrin is then recycled back to circulation. Vertebrate transferrins consist of a single polypeptide chain with a twofold internal repeat, resulting in two homologous lobes (N- and C-lobes). The lobes contain similar iron-binding sites located in a deep cleft between two /ß subdomains. A single Fe(III) is bound in this cleft to four amino acid side chains and to a
ion. The SCOP species level of human serum transferrin (sunid 53899) includes 20 entries for the N-lobe.
Figure 4 shows the corresponding dendrogram and the structure comparison of the different models. Two major clusters separated by a large dissimilarity value are visible in the dendrogram, C1 and C2 (Figure 4A). In fact, these two clusters correspond to the iron-free apo form (C1) and to the iron-binding holo form of transferrin (C2). The apo form is transformed into the holo form by a large (63°) rigid-body rotation of one of the /ß subdomains relative to the other, resulting in the opening of the iron-binding cleft (see Figure 4B).
|
Within each of these clusters (C1, C21 and C22), the backbone differences are small and the different models correspond to single residue mutants, different crystallization conditions or different expression systems. In particular, within the C1 cluster, one can observe two subclusters with low dissimilarity. They match almost identical structures (1btj and 1bp5 PDB entries) derived from two closely related crystal forms (Jeffrey et al., 1998).
Figure 5 shows the values of average silhouette width for PAM clustering from one to 19 clusters. Best clustering is achieved with three clusters with a high average silhouette width (0.939) indicating a clear separation between the clusters. The three clusters correspond to C1, C21 and C22. The representatives are d1bp5d_, d1a8f__ and d1n84a_ for C1, C21 and C22, respectively.
|
The examples so far demonstrate ensembles which can be clearly classified into different types of structures. It is also possible that the structural neighborhood of a model (the other closely related models) is less clearly defined. This is the case when the structural differences in the set are more continuous or when the structural neighborhood of a given model varies along the polypeptide chain. These cases are associated with a lower silhouette width. The clustering of glucose dehydrogenase structures from Bacillus megaterium provides such an example. The protein consists of a tetramer with four identical subunits. The corresponding SCOP species level (sunid 51785) includes four sequence-identical entries from the same PDB file (Yamamoto et al., 2001).
The C conformation is very similar in these four entries. The only significant difference is located in a flexible surface loop (Arg39Asp44) (see Figure 6A). Models d1gcoe_ and d1gcof_ (with a very similar backbone conformation) differ from the other models in this region. In particular, the differences between d1gcoe_ and d1gcof_ on the one hand and d1gcoa_ on the other are clear for residues 41 and 44, but the respective differences to 1gcob_ are only significant for residue 44. From the structural comparison it is clear that d1gcoe_ and d1gcof_ should belong to the same cluster and d1gcoa_ to a different one, whereas d1gcob_ is an intermediate case.
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The current method only takes into account the C conformations. Including the coordinates of the side-chain atoms would make the procedure more sensitive to small structural changes, which might be important for more detailed structural analysis. In addition, we will also investigate how to include temperature factors and occupancy information in the method.
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Carugo,O. and Pongor,S. (2002) J. Mol. Biol., 315, 887898.[CrossRef][ISI][Medline]
Chandonia,J.M., Hon,G., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2004) Nucleic Acids Res., 32, D189D192.
Choi,I.G., Kwon,J. and Kim,S.H. (2004) Proc. Natl Acad. Sci. USA, 101, 37973802.
Gordon,A.D. (1999) Classification, 2nd edn. Chapman and Hall, London.
Heine,A., DeSantis,G., Luz,J.G., Mitchell,M., Wong,C.H. and Wilson,I.A. (2001) Science, 294, 369374.
Holm,L.L. and Sander,C. (1996) Science, 273, 595602.
Jeffrey,P.D., Bewley,M.C., MacGillivray,R.T., Mason,A.B., Woodworth,R.C. and Baker,E.N. (1998) Biochemistry, 37, 1397813986.[CrossRef][ISI][Medline]
Kaufman,L. and Rousseeuw,P.J. (1990) Finding Groups in Data: an Introduction to Cluster Analysis. Wiley-Interscience, New York.
Kelley,L.A. and Sutcliffe,M.J. (1997) Protein Sci., 6, 26282630.
Kelley,L.A., Gardner,S.P. and Sutcliffe,M.J. (1996) Protein Eng., 9, 10631065.[ISI][Medline]
Laboulais,C., Ouali,M., Le Bret,M. and Gabarro-Arpa,J. (2002) Proteins, 47, 169179.[CrossRef][ISI][Medline]
MacGillivray,R.T. et al. (1998) Biochemistry, 37, 79197928.[CrossRef][ISI][Medline]
May,A.C. (1999) Proteins, 37, 2029.[CrossRef][ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Myers,E.W. and Miller,W. (1989) Comput. Appl. Biosci., 4, 1117.
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Rousseeuw,P. (1987) J. Comput. Appl. Math., 20, 5365.[CrossRef][ISI]
Sayle,R. and Milner-White,E. (1995) Trends Biochem. Sci., 20, 374374.[CrossRef][ISI][Medline]
Shatsky,M., Nussinov,R. and Wolfson,H.J. (2002) In Guigó,R. and Gusfield,D. (eds), Proceedings of the 2nd Workshop on Algorithms in Bioinformatics (WABI). Springer, Berlin, pp. 235250.
Struyf,A., Hubert,M. and Rousseeuw,P.J. (1997) Comput. Stat. Data Anal., 26, 1737.[CrossRef][ISI]
Yamamoto,K., Kurisu,G., Kusunoki,M., Tabata,S., Urabe,I. and Osaki,S. (2001) J. Biochem. (Tokyo), 129, 303312.[Abstract]
Received June 18, 2004; revised August 3, 2004; Edited by Andrej Sali
|