SMoS: a database of structural motifs of protein superfamilies
Saikat Chakrabarti1,
K. Venkatramanan2 and
R. Sowdhamini1,3
1National Centre for Biological Sciences (NCBS), Bangalore 560 065, India and
2Centre for Biotechnology, Anna University, Chennai 600025, India
3 To whom correspondence should be addressed. e-mail: mini{at}ncbs.res.in
 |
Abstract
|
---|
The Structural Motifs of Superfamilies (SMoS) database provides information about the structural motifs of aligned protein domain superfamilies. Such motifs among structurally aligned multiple members of protein superfamilies are recognized by the conservation of amino acid preference and solvent inaccessibility and are examined for the conservation of other features like secondary structural content, hydrogen bonding, non-polar interaction and residue packing. These motifs, along with their sequence and spatial orientation, represent the conserved core structure of each superfamily and also provide the minimal requirement of sequence and structural information to retain each superfamily fold.
 |
Introduction
|
---|
The superfamily is a hierarchical classification that contains proteins of different families and subfamilies having similar structure and function. These proteins might have very low sequence identities but retain the same fold through well conserved secondary structural parts. On the basis of conservation of criteria, like amino acid preference and solvent accessibility, several conserved segments of proteins belonging to the same superfamily have been identified. These segments are termed structural motifs. These motifs, along with their sequence and spatial orientation and preservation of various structural criteria, represent the conserved core of each superfamily. The structural features of such motifs for several superfamilies are integrated into the Structural Motifs of Superfamilies (SMoS) database. The definition of superfamilies is in direct correspondence with SCOP (Murzin et al., 1995
). One of the main purposes of the SMoS database is to provide important sequence segments that can be projected as the minimum structural requirements for a new member to be considered part of a pre-existing superfamily. Such motifs can also be employed to design and rationalize protein engineering and folding experiments.
 |
Description
|
---|
Aligned sequences of superfamilies have been obtained from CAMPASS (Sowdhamini et al., 1998
) and PASS2 (Mallika et al., 2002
) databases where COMPARER (Sali and Blundell, 1990
) has been employed to derive structure-based sequence alignments. An automated method has been developed that extracts the segments of the alignment, which form the conserved core in all the protein members of the superfamily, using conservation of solvent inaccessibility and amino acid exchange. Structural templates are identified by the presence of at least three consecutive solvent-buried residues that have higher amino acid exchange scores. Such identified structural motifs are mapped on the alignment of the multiple member protein superfamilies. Interactive three-dimensional views of the motifs on the superposed structure of the superfamily members are displayed for better understanding and visualization. Inter-motif distances are defined as average inter-motif, inter-Cß distances between the equivalent amino acid residues (in the case of glycine, the side chain hydrogen is replaced by a virtual Cß) and an all-to-all matrix of these distances is employed to represent the motifs in the form of a dendrogram. This tree representation provides the extent of structural proximity between the motifs.
More structural parameters like secondary structural content, hydrogen bonding, non-polar interaction and residue packing (Ooi number; Nishikawa and Ooi, 1986
) are examined among structurally aligned multiple members of protein superfamilies. The motifs are also ranked on the basis of conservation of all these criteria (Figure 1). A structural feature is considered conserved at an alignment position if it is present in all or all-but-one members of a superfamily. The average conservation score, considering all six structural features, has been calculated and is represented in a graphical format. The extent of conservation of the structural features is also compared between the identified motif regions and the rest of the protein.
Despite similar topology or conservation of residues, evolutionary divergence and poor sequence identity amongst superfamily members is often reflected as differences in the orientations and positions of individual structural motifs. The motif regions are transformed into a vector representation by the least-squares fit method (Chou et al., 1984
; Srinivasan et al., 1991
) (Figure 2). Spatial distances and virtual torsion angles for all the motifs are calculated and represented in the form of a matrix. The average distance and absolute angle for each motif vector with respect to the centre of mass of the protein are also listed for each superfamily. Average deviations of even up to 3 Å in the virtual distances and 40° in the virtual torsions are tolerated and observed within most superfamilies in the current data set. An average volume and depth (depth is defined as the average distance from the extreme surface points of the protein) of the motif-surrounded regions for all the proteins of each superfamily are calculated and a percentage fraction of the same compared with the volume of the whole protein and these values are also provided.
 |
Applications
|
---|
The availability of such information is useful since they are conserved sequence patterns that will assist in the identification of more members of an existing superfamily. Motifs, derived from 12 superfamilies, when scanned into non-redundant sequence databases could successfully recognize 104 uncharacterized or hypothetical proteins, which are distantly related to known superfamilies of proteins and unobtainable by other sensitive procedures (S.Chakrabarti and R.Sowdhamini, unpublished data). For example, the connections between a hypothetical protein and members of the cysteine hydrolase superfamily could be identified using this approach despite a poor sequence identity of 15%. Such structural templates or motifs provide constraints that are complementary to functional motifs obtained from various resources. The utilization of spatial restraints derived from structural templates also results in more accurate three-dimensional models of protein sequences using homology modelling techniques where there is a distant relationship between the query and any of the structural homologues that are detailed elsewhere (S.Chakarabarti, J.John and R.Sowdhamini, submitted for publication). This strategy can be employed, in general, to overcome the inherent limitation of comparative modelling methods when using multiple distantly related templates.
 |
Discussion
|
---|
Structural motifs can be used as sequence signatures for proteins belonging to a similar functional class under the classification strata of the superfamily. Therefore, these conserved regions can be utilized to identify and classify similar sequences of the superfamily of proteins. The objective definition of a structural motif is somewhat context dependent. We have used the conservation of structural features like amino acid sequence similarity and solvent burial as the primary requirement for identification of structural motifs since they represent the core of proteins. This has been emphasized even for homologous families (Zvelebil et al., 1987
; Overington et al., 1990
). However, the conservation of other structural criteria is not critical and therefore is not viewed as deterministic to the objective identification.
The availability of a web resource for structural motifs of superfamilies is valuable since the evolutionary divergence makes it impossible to derive conserved sequence segments simply by residue conservation. Identification and projection of structure-based motifs mapped on alignments will be useful for improving alignments and to build better three-dimensional models involving distant relationships. This is a natural follow-up of alignments of distantly related proteins that can be grouped into superfamilies (Sowdhamini et al., 1998
). Structural motifs provided in the SMoS database have important applications in sequence searches, sequence alignments and distant homology modelling. This can also help to rationalize and design mutation experiments in proteins.
Availability
The SMoS database is accessible via http://www.ncbs.res.in/
faculty/mini/SMoS/index.htm
 |
References
|
---|
Chou,K.C., Memethy,G. and Scheraga,H.A. (1984) J. Am. Chem. Soc., 106, 31613170.[ISI]
Felsenstein,J. (1997) Syst. Biol., 46, 101111.[ISI][Medline]
Mallika,V., Bhaduri,A. and Sowdhamini,R. (2002) Nucleic Acids Res., 30, 284288.[Abstract/Free Full Text]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Nishikawa,K. and Ooi,T. (1986) J. Biochem., 100, 10431047.[Abstract]
Overington,J.P., Johnson,M.S., Sali,A. and Blundell,T.L. (1990) Proc. R. Soc. Lond. B Biol. Sci., 241, 132145.[ISI][Medline]
Sali,A. and Blundell,T.L. (1990) J. Mol. Biol., 212, 403428.[CrossRef][ISI][Medline]
Sayle,A. and Minler-White,E.J. (1995) Trends Biochem. Sci., 20, 374375.[CrossRef][ISI][Medline]
Sowdhamini,R., Burke,D.F., Huang,J.F., Mizuguchi,K., Nagarajaram,H.A., Srinivasan,N., Steward,R.E. and Blundell,T.L. (1998) Structure, 6, 10871094.[ISI][Medline]
Srinivasan,N., Sowdhamini,R., Ramakrishnan,C. and Balaram,P. (1991) In Balaram,P. and Ramaseshan,S. (eds), Molecular Conformation and Biological Interactions. Indian Academy of Sciences, Bangalore, pp. 5973.
Zvelebil,M.J., Barton,G.J., Taylor,W.R. and Sternberg,M.J. (1987) J. Mol. Biol., 195, 957961.[ISI][Medline]
Received April 25, 2003;
revised September 9, 2003;
accepted September 24, 2003.