The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues

J.E. Bray1,2, A.E. Todd1, F.M.G. Pearl1, J.M. Thornton1,3 and C.A. Orengo1

1 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street,London WC1E 6BT, UK, 3 Department of Crystallography, Birkbeck College, Malet Street, London WC1E 7HX, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
A consensus approach has been developed for identifying distant structural homologues. This is based on the CATH Dictionary of Homologous Superfamilies (DHS), a database of validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies (URL: http://www.biochem.ucl.ac.uk/bsm/dhs). Multiple structural alignments have been generated for 362 well-populated superfamilies in the CATH structural domain database and annotated with secondary structure, physicochemical properties, functional sequence patterns and protein–ligand interaction data. Consensus functional information for each superfamily includes descriptions and keywords extracted from SWISS-PROT and the ENZYME database. The Dictionary provides a powerful resource to validate, examine and visualize key structural and functional features of each homologous superfamily. The value of the DHS, for assessing functional variability and identifying distant evolutionary relationships, is illustrated using the pyridoxal-5'-phosphate (PLP) binding aspartate aminotransferase superfamily. The DHS also provides a tool for examining sequence–structure relationships for proteins within each fold group.

Keywords: CATH database/functional annotation/homologous superfamily/protein domains/structural alignments


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The protein structure database (PDB), jointly managed by the Research Collaboratory for Structural Bioinformatics (RCSB) and the European Bioinformatics Institute (EBI), contains over 9000 protein chains consisting of over 18 000 domains determined by X-ray crystallography or NMR. (Abola et al., 1987, URL: http://www.rcsb.org/pdb). With 200 new structures being deposited each month and structural genomic projects gathering momentum (Pennisi, 1998Go), it becomes increasingly important that the structural data are organized and annotated in a biologically meaningful and useful way.

Protein structure classification databases, CATH (Orengo et al., 1997Go), SCOP (Murzin et al., 1995Go), Dali Domain Dictionary (Holm and Sander, 1999Go), 3Dee (Siddiqui and Barton, 1997Go), DDBASE (Sowdhamini et al., 1996Go) and ENTREZ/MMDB (Hogue et al., 1996Go; Marchler-Bauer et al., 1999Go) are now well established and provide frameworks for ordering the known protein universe. These databases cluster proteins into fold groups or evolutionary families using manual methods (Murzin et al., 1995Go) or automatic structure comparison methods, i.e. SSAP (Taylor and Orengo 1989Go), DALI (Holm and Sander, 1993Go), STAMP (Russell and Barton, 1992Go), DIAL (Sowdhamini and Blundell, 1995Go) and VAST (Gibrat et al., 1997Go). There are no consensus definitions for fold similarity and groups apply different criteria for assigning proteins to fold groups (reviewed in Orengo, 1994, and Brown et al., 1996). More recently, databases have emerged that present structural alignments for selected protein families. HOMSTRAD (Mizuguchi et al., 1998aGo) contains 372 alignments for homologous families originally developed by Overington et al. (1990). CAMPASS (Sowdhamini et al., 1998Go) is a database of 52 structurally aligned protein superfamilies derived from DDBASE. In both cases the alignments are annotated with structural features using JOY (Mizuguchi et al., 1998bGo).

CATH and SCOP are currently the largest manually validated hierarchical classifications of protein domain structures (Hubbard et al., 1999Go; Orengo et al., 1999Go). Both of these databases group proteins into homologous families and superfamilies. These levels are interesting to many biologists as they cluster proteins that have descended from a common ancestral gene and whose core structural and functional features have often been conserved by evolution (Chothia, 1984Go; Overington et al., 1990Go). Identifying the homologous superfamily of a protein of interest can be an important step in determining the biological role of the protein. Proteins are called analogues when they share a common fold but there is no definitive evidence that they are related by evolution. This has been observed to occur in some highly populated fold groups described as superfolds (Orengo et al., 1994Go) or frequently occurring domains (FODs, Brenner et al., 1997). They are particularly common in nature, possibly owing to the inherent thermodynamic stability of the fold and prevalence of common recurring structural motifs (Salem et al., 1999Go).

Homologous relationships can be identified relatively easily if the divergent proteins retain high sequence identities of 30% or more (Chothia and Lesk, 1986Go; Sander and Schneider, 1991Go; Rost, 1999Go). The presence of a functional sequence motif (PROSITE, Hofmann et al., 1999) or set of motifs (PRINTS, Attwood et al., 1999) can be used to detect more distant sequence relationships. More recently, sequence searching methods that use profile-based approaches or intermediate sequences, e.g. PSI-BLAST, (Altschul et al., 1997Go), hidden Markov models (SAM-T98, Hughey and Krogh, 1996), ISS (Park et al., 1997Go) and MISS (Salamov et al., 1999Go), have also been shown to detect more distant homologues than pairwise sequence techniques (Park et al., 1998Go; Salamov et al., 1999Go). Sequence databases of homologous protein families and their associated multiple alignments have been established using these methods, e.g. PROSITE, PRINTS, PFAM (Bateman et al., 1999Go), SMART (Ponting et al., 1999Go) and PRODOM (Corpet et al., 1999Go).

More distant evolutionary relationships (<20% sequence identity) are difficult to elucidate without a combination of structural and functional evidence to prove homology (see Murzin, 1996, 1998 for reviews). SCOP uses the literature to manually identify unusual structural features (e.g. beta-bulges) and key conserved residues involved in structure stabilization, substrate or co-factor binding or catalysis (Murzin et al., 1995Go; Brenner et al., 1996Go). In the CATH database, high structural similarity indicated by the SSAP structure comparison algorithm is used to infer homologous proteins that are then validated by checking the literature and available functional information. While CATH and SCOP contain similar classifications for close homologues, the difficulty in identifying distant relationships means that they differ for the more distant homologues (Hadley and Jones, 1999Go). Other approaches for identifying homologous proteins are based on deriving core structures for protein families (Schmidt et al., 1997Go; Matsuo and Bryant, 1999Go; Orengo, 1999Go) or comparing functional descriptions, e.g. SWISS-PROT keywords (Holm and Sander, 1997Go).

Manual validation of very distant evolutionary homologues using functional data can be a very time-consuming step. However, because there is currently no consistent format for most functional data, a purely automated approach can sometimes fail to detect significant relationships and can lead to incorrect homologue assignment. The possibility of functional inheritance errors in sequence databases (Bork and Koonin, 1998Go) highlights the need for a similarly cautious approach to homologue classification for structural databases.

For this reason we have developed the CATH Dictionary of Homologous Superfamilies (DHS), a Web-based resource that provides multiple structural alignments for each superfamily containing more than one non-identical representative (362 families) and facilitates the recognition of distant homologues. Alignments and structural profiles for each superfamily have been generated using the CORA suite of programs (Orengo, 1999Go). The alignments permit the identification of consensus superfamily-specific properties (e.g. conserved protein–ligand interactions). These consensus features can be used to verify the results of sequence (e.g. PSI-BLAST) or structure (e.g. SSAP, CORA structural profiles) comparison methods that have inferred a distant homologous relationship to a protein in a CATH superfamily.

We illustrate the value of the DHS as a diagnostic tool using the pyridoxal-5'-phosphate (PLP) binding aspartate aminotransferase superfamily which contains diverse sequences and a broad range of functions. The CORA multiple alignments can easily be downloaded from the DHS Web pages and will be updated with each CATH database release.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Protein data set: overview of CATH families

The CATH database (release 1.5, October 1998) provides a hierarchical classification of 14 382 protein domain structures in the PDB (Orengo et al., 1997Go). The major classification levels are protein class (C), architecture (A), topology (T), homologous superfamily (H) and sequence family (S) (see Table IGo). To reduce the level of redundancy in the PDB, the proteins in CATH with >95% sequence identity are clustered into 2807 near-identical protein families (N-level). The CATH-95 dataset includes one representative structure from each N-level family and is used for all structural comparisons in the construction of the DHS. Sequence families contain close evolutionary relatives that are identified using sequence comparison methods (Needleman and Wunsch, 1970Go) and conservative sequence identity cut-offs (>=35%).


View this table:
[in this window]
[in a new window]
 
Table I. The number of branches at each hierarchical level in the CATH database (release 1.5)
 
There are 959 homologous superfamily levels where more distantly related structures are grouped according to structural and functional similarities that suggest evolution from a common ancestor. PSI-BLAST (Altschul et al., 1997Go) is now used for detecting remote homologues in the CATH classification procedure using stringent E-value cut-offs (0.0005) and scanning against a translated non-redundant complexity masked GenBank database (Benson et al., 1999Go). More distant homologues are identified using the SSAP structure comparison algorithm (Taylor and Orengo, 1989Go; Orengo et al., 1992Go) and CORA structural profiles for each homologous superfamily (Orengo, 1999Go). The DHS presents those 362 superfamilies that contain more than one CATH-95 structure. Figure 1Go summarizes the steps required to generate the DHS.



View larger version (88K):
[in this window]
[in a new window]
 
Fig. 1. Flowchart of the methods and data required for creating and maintaining the Dictionary of Homologous Superfamilies (DHS).

 
Generation of data for the DHS

Generation of structure comparison data using SSAP. In order to provide data on sequence and structure relationships within each fold group in CATH, SSAP structural comparisons were performed for each pair of CATH-95 domains within each fold (117 239 pairwise comparisons). Protein pairs within the same fold and same superfamily are defined as homologues whereas those within the same fold but in different superfamilies are analogues. Fold level comparisons provide a complete dataset for analysing analogues and homologues and checking for any incorrect classifications. SSAP returns a normalized score between 0 and 100 for each pairwise comparison. Scores above 70 accompanied by significant residue overlap (>=60%) indicate proteins with the same fold or topology (T). Higher scores (>80) suggest a homologous relationship between two proteins. Sequence identity in this study is defined as the number of identical residues (after structural alignment by SSAP) divided by the number of residues in the smallest protein domain. SSAP score and sequence identity matrices are available for each superfamily in the DHS Web pages.

Automatic validation of structural relatives (DHS-VALID). The DHS-VALID program is used to check automatically all the pairwise sequence and structure comparison data generated for each fold group and homologous superfamily in CATH. Outlying proteins that have low structural similarity scores and low structural overlap percentages against the majority of the relatives are identified. These can then be checked against the known functional information and ligand interaction data and if necessary placed into a newly created homologous family. Similarly, high SSAP scores can identify potentially homologous proteins that are currently classified in different superfamilies.

Generation of multiple structural alignments using CORA. Multiple structural alignments were generated using the CORA program (Orengo, 1999Go) for each of the 362 homologous superfamilies with more than one CATH-95 structure. CORA (Conserved Residue Attributes) is a suite of programs for automatically multiply aligning and analysing protein structural families. CORA uses the pairwise structural comparison data from SSAP to determine the initial set of proteins to be aligned and then identifies conserved characteristics and expresses them as a 3D structural profile for each family. As the profiles encapsulate the critical `core' of the fold and functional sites, which in the case of homologous proteins have been conserved throughout evolution, they are more sensitive at identifying distant structural homologues than comparing against a single structure.

Annotation of structural alignments. The residues in the multiple structural alignments are annotated by colour in several different ways using the program DHS-PLOT. At the simplest level, plots are colour coded according to secondary structure regions (as defined by DSSP; Kabsch and Sander, 1983), sequence identity and amino acid type (Taylor, 1997Go). A shaded score bar beneath each alignment indicates the CORA structural conservation score which measures the conservation of the structural environment (dark grey/black are highly conserved regions).

PROSITE sequence patterns (Hofmann et al., 1999; URL: http://www.expasy.ch/prosite) are also included in the multiple alignment data for each homologous superfamily. Only structurally significant PROSITE patterns (from release 1.4) are used, as identified for all known PDB structures (Kasuya and Thornton, 1999Go). Ligand interaction data is derived from the GROW algorithm (Milburn et al., 1998Go). Importantly, the structural alignment plots help to identify the consensus PROSITE patterns and ligand interaction positions that exist in the structurally conserved regions for a given protein superfamily.

PROSITE patterns, ligand interactions and domain boundary representations are also brought together in DOMPLOT diagrams (Todd et al., 1999aGo) which are available for each homologous superfamily. The 3D structural superpositions can be viewed interactively in a RASMOL viewer (Sayle and Milner-White, 1995Go) using a customized RASMOL script (ROMLAS; R.A.Laskowski, unpublished computer program) that colours the 3D structures to complement the shading of the 2D sequence plots.

Methods for the automatic extraction of functional information. Protein information as given in the PDB header including the protein name, species, crystallographic or NMR information was extracted directly from PDBsum web pages (Laskowski et al., 1997; URL: http://www.biochem.ucl.ac.uk/bsm/pdbsum). The files from the ENZYME database (Bairoch, 1999, release 23.0, July 1998) and the SWISS-PROT database (Bairoch and Apweiler, 1999, release 35.0) were downloaded from the ExPasy website (URL: http://www.expasy.ch). SWISS-PROT entries give links to structures in the PDB but do not specify individual PDB chains. A cross-reference table of PDB chain to SWISS-PROT entry was derived from searching PDB chain sequences against SWISS-PROT and using the highest identity matches from SWISS-PROT (A.C.R.Martin, personal communication). The EC numbers for PDB chains were automatically extracted from the SWISS-PROT files using this cross-reference table (Martin et al., 1998Go). The summary information from the ENZYME file (description and reaction entries) and the SWISS-PROT file (comments and keyword entries) was extracted using Perl scripts.

Compiling and updating the DHS. DHS-WEB is a suite of programs and scripts that are used to combine the multiple alignments with the functional data, consensus sequence patterns and information about sequence/structure relationships. The output is a DHS Web page of functional summary tables in HTML for each homologous superfamily. The programs can be run automatically to regenerate the DHS web pages and links to other web sites for each new release of the CATH database as further relatives are added to the homologous superfamilies. The CATH classification data is stored in a Postgres relational database and the DHS data is in flat file format. CATH and the DHS are currently being transferred into an Oracle relational database that will be updated initially on a 6 monthly cycle and subsequently on a quarterly cycle. Once the Oracle database is set up, CATH and DHS mirror sites will be established in three locations (USA, Canada and Cambridge, UK). DHS alignment files will be made available from the CATH ftp site (URL: ftp://ftp.biochem.ucl.ac.uk/pub/cathdata).


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The CATH Dictionary of Homologous Superfamilies (DHS)

The CATH DHS is a resource providing multiple structural alignments in the 362 well-populated CATH superfamilies and facilities for viewing them (URL http://www.biochem.ucl.ac.uk/bsm/dhs). These alignments establish which residues are structurally equivalent across a whole superfamily and, since they are annotated with consensus sequence patterns and functional information, the DHS provides a valuable diagnostic tool for detecting remote homologous relationships based on the identification of consensus superfamily specific features.

Conserved active site residues, ligand interactions or sequence motifs often provide the functional signature necessary to prove an evolutionary link. The DHS resource summarizes these properties for each superfamily on a single Web page. It is linked to the CATH structure classification server (URL: http://www.biochem.ucl.ac.uk/bsm/cath/server) allowing the biologist to scan the CATH database with a newly determined structure and then access the DHS pages for each match to assess the potential evolutionary relationship.

Using the DHS as a diagnostic tool for identifying remote homologues in the PLP-dependent aspartate aminotransferase superfamily (large domain)

Pyridoxal-5'-phosphate (PLP, a vitamin B6 derivative) is a versatile cofactor, able to catalyse many different reactions involved in nitrogen metabolism in all organisms (Jansonius, 1998Go). The PLP-dependent aspartate aminotransferase superfamily is presented here to demonstrate the use of the DHS as a resource for homologue classification within the CATH database. Recent studies of PLP-dependent enzymes have shown that there are five distinct PLP-binding domain folds (Denessiouk et al., 1999Go). All the enzymes have PLP bound to an active site lysine, forming an internal aldimine. Once the amino acid substrate reacts with the cofactor, any one of three remaining bonds around the C{alpha} may be cleaved, enabling a broad range of reactions including transamination, racemization and decarboxylation. Reaction specificity is due to interactions with the groups surrounding the C{alpha} of the substrate that favour a particular bond cleavage (Martell, 1982Go).

In the PLP-dependent aspartate aminotransferase superfamily, all the enzymes have two distinct domains that have different topologies (Figure 2AGo) and so are classified in different superfamilies in the CATH domain database. The PLP binds covalently to a conserved lysine residue (in the large domain) at the bottom of the interdomain active site cleft (Figure 2BGo). The large domain has three-layer sandwich architecture with a seven-stranded mixed ß-sheet forming the domain core that is surrounded by helices on both sides (CATH code 3.40.640). The central ß-sheet is mostly parallel with one anti-parallel edge strand. This is reminiscent of the Rossmann fold which has the same three-layer sandwich architecture but has a six-stranded parallel ß-sheet at the domain core. Figure 2CGo shows the TOPS diagram (Westhead et al., 1998Go) to represent schematically the topology of the large domain. The small domain has an {alpha}–ß plait topology and can be found in DHS Web pages for CATH code 3.30.70.160.





View larger version (136K):
[in this window]
[in a new window]
 
Fig. 2. Structural features of aspartate aminotransferase (PDB entry 2cst). (A) Cartoon representation of the large (light grey) and small (dark grey) domains of the enzyme complexed with pyridoxal-5'-phosphate (PLP, space-fill representation). The figure was generated using MOLSCRIPT (Kraulis, 1991Go) and RASTER3D (Merritt and Bacon, 1997Go). (B) The LIGPLOT diagram (Wallace et al., 1995Go) shows the interactions between the aspartate aminotransferase and the PLP co-factor. PLP is covalently bound to lysine 258 in the large domain. Hydrogen bonds are indicated with dotted lines and hydrophobic contacts are shown with spoked arcs. (C) Schematic TOPS diagram (Westhead et al., 1998Go) for the large domain (CATH entry 2cstA2). Strands are shown as triangles and helices as circles (310 helices are not shown). Lines drawn over a symbol are connections to the top of a secondary structure, otherwise connection is to the base.

 
The multiple structural alignments and functional tables shown here are for the large domain superfamily (CATH code 3.40.640.10) that contains four sequence families and eight representative domains (Table IIGo). There is <11% sequence identity between relatives of these different sequence families (Figure 3Go). Pairwise SSAP scores of 74–78 indicate that the proteins have similar folds but are in the twilight zone of structural similarity regarding their evolutionary relationship. These families would not be merged automatically to form a homologous superfamily as the sequence identity and structural similarity are not sufficiently high and a number of different functions are observed across the set. Instead, the DHS functional tables and annotated structural alignments can be consulted to search for any possible evolutionary link.


View this table:
[in this window]
[in a new window]
 
Table II. The four sequence families in the PLP-dependent aspartate aminotransferase homologous superfamily (CATH code 3.40.640.10)
 


View larger version (52K):
[in this window]
[in a new window]
 
Fig. 3. Sequence identity and SSAP structural comparison relationships between representatives from the sequence families in the aspartate aminotransferase homologous superfamily (2cstA2, 1ax4A2, 2dkb02, 1ordA2).

 
Figure 4AGo shows the DHS summary table containing the protein information as given in the PDB header which includes the protein name, species and crystallographic or NMR information. There are also links from the DHS table to PDBsum pages, individual MOLSCRIPT pictures (Kraulis, 1991Go) and DOMPLOT diagrams. Enzyme Classification (EC) numbers [Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC–IUBMB), 1992] are also listed (Figure 4BGo) and it can be seen that these are spread over two primary classes, transferases (class 2) and lyases (class 4), indicating the functional diversity in this superfamily. A recent analysis has shown that this type of diversity is observed in 9% (17) of the 190 enzyme homologous superfamilies in CATH with multiple members in the PDB. Almost half (91/190) of the superfamilies have different EC numbers at some level in the EC hierarchy (Todd et al., 1999bGo). However, it can be seen from the EC table that even though the EC numbers are different, the enzyme reactions all contain an L-amino acid (or similar) as either the substrate or the product. The links go to the ENZYME Web site where more detailed information can be found (see Materials and methods section for all Web site locations).





View larger version (194K):
[in this window]
[in a new window]
 
Fig. 4. Screen snapshots from the DHS Web page for the aspartate aminotransferase superfamily. Information for the eight representative domains is shown in tables for (A) protein descriptions, (B) enzyme classification numbers, (C) PROSITE patterns, (D) SWISS-PROT descriptions and (E) ligand data.

 
Another DHS table gives information on sequence patterns extracted from PROSITE. Figure 4CGo shows that each sequence family has a different PROSITE pattern and name; however, the multiple structural alignment shows that the patterns overlap around the totally conserved lysine residue that binds the PLP (Figure 5AGo). Further inspection of this consensus feature reveals that the different patterns are all PLP binding motifs, whilst the SWISS-PROT summary table (Figure 4DGo) allows the user to identify the pyridoxal phosphate (PLP) keywords in all the protein entries. PLP is also listed in the ligand information table (Figure 4EGo). As the summary tables have links to the original databases, more detailed information can be quickly accessed.





View larger version (226K):
[in this window]
[in a new window]
 
Fig. 5. CORA multiple structural alignment for the aspartate aminotransferase superfamily. (A) The sequence alignment for the region surrounding the totally conserved lysine residue (alignment position 308) that binds PLP is shown in three different colour schemes, secondary structure, PROSITE patterns and ligand interactions. (B) The DOMPLOT diagram (Todd et al., 1999aGo) for the sequence family representatives in this superfamily (2cstA2, 1ax4A2, 2dkb02, 1ordA2). Coloured vertical lines within the boxes highlight residues involved in ligand interactions, coloured horizontal lines below the boxes denote PROSITE motifs. Consensus ligand interactions are shown as small black lines at the bottom of the plot. Specific residue information can be obtained by cross-reference with the DHS ligand interaction table (not shown). (C) The 3D structures can be superposed and viewed in RASMOL (Sayle and Milner-White, 1995Go) as seen here for the sequence family representatives. The coloured regions correspond to the ligand interacting residues.

 
DOMPLOT diagrams of the structural alignment are also used to visualize protein–ligand interactions and show that there is a high degree of conservation in the positioning of the PLP contacting residues (Figure 5BGo). There are four conserved ligand-interacting regions for the four sequence family representatives (2cstA2, 1ax4A2, 2dkb02, 1ordA2) that all bind PLP (or similar). Ligand binding positions can also be viewed on the multiple superposition (Figure 5CGo). Considering that these proteins all belong to the same fold group, the PLP binding data, consensus PROSITE motifs and L-amino acid substrates provide the explicit functional evidence to support the structural data that these are evolutionary relatives and should therefore be classified in the same CATH homologous superfamily.

Using the DHS data to analyse sequence and structure relationships in CATH

DHS data on pairwise sequence and structural similarity data within each fold can be used to analyse structural relationships within different CATH fold groups and superfamilies. For example, Figure 6Go shows a sequence–structure plot for the PLP aspartate aminotransferase family, illustrating that the current CATH family contains diverse relatives (<15% sequence identity) as well as close homologues containing high structural similarity (SSAP score >=80) and significant sequence identity (>=25%). The DHS contains a sequence–structure plot for each homologous superfamily.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 6. Sequence–structure plot for proteins within the aspartate aminotransferase superfamily. Pairwise comparisons are shown as black triangles. The majority of proteins in this family are distant evolutionary relatives and have sequence identities below 15%.

 
Currently, >30% of homologous superfamilies in CATH belong to superfold groups (Orengo et al., 1994Go). Amongst the most highly populated of these folds are the globins, immunoglobulins, TIM barrels and the Rossmann fold that contain between four and 72 different homologous superfamilies (see Table IIIGo). The pairwise comparison data in the DHS has been used to explore the structural relationships and sequence identity variation (based on structural alignment) within these superfold groups. Interestingly, we find that sequence identity distributions for homologous proteins show a dependence on the type of fold group studied (Figure 7A–DGo) whilst the distribution of analogous sequence identities is found to peak between 6 and 8% independent of the fold group (Figure 7E–HGo). This is in broad agreement with a previous analysis by Russell et al. (1997).


View this table:
[in this window]
[in a new window]
 
Table III. The number of homologous superfamilies, sequence families and representative protein domains (95% sequence identity) for four of the most populated fold groups in the CATH database
 


View larger version (23K):
[in this window]
[in a new window]
 
Fig. 7. Plots to show the sequence identity distributions for pairs of homologous and analogous proteins in the globin fold (A, E), TIM barrel fold (B, F), Rossmann fold (C, G) and the immunoglobulin fold (D, H). The plots for the analogous proteins are underneath the respective homologous distribution.

 
In the globin superfold (Figure 7AGo) there is a broad homologue distribution between 8 and 30% sequence identity that is distinct from the analogue distribution found at lower identities. This superfold contains only four homologous superfamilies, the largest of which are the oxygen-binding globins (43 out of 54 CATH-95 representatives) that have a high degree of structural conservation. The relatively high sequence identity distribution in this superfold may be caused by the structural constraints of providing the correct environment for the large haem group to function specifically as a reversible oxygen-binding co-factor.

By contrast, in the homologue distributions for both TIM barrel folds (Figure 7BGo) and Rossmann folds (Figure 7CGo) there are large peaks at 6% sequence identity, much lower than for the globin fold. These homologous distributions have considerable overlap with their respective analogous distributions. The shift to lower sequence identities relative to the globin distributions may be due to the considerable functional diversity of TIM barrel and Rossmann fold proteins. If just the single domain proteins are considered, there are 46 enzyme functions in the TIM barrel fold and 37 different functions in the Rossmann fold (A.E.Todd, unpublished data). This functional diversity may be the consequence of a stable structural framework that is tolerant to extensive changes in the sequence so that there remains very little similarity in the sequences and no common sequence motifs between proteins having different functions. In these folds, analogous pairs may be very distantly related homologues. However, this is very difficult to prove without the presence of intermediate sequences or further evolutionary evidence such as proteins being part of a common metabolic pathway.

The homologous distribution of the immunoglobulin superfold unusually exhibits two large peaks below 30% sequence identity (Figure 7DGo). There are 30 homologous superfamilies in the immunoglobulin fold with a total of 320 CATH-95 representative domains. The fold population is strongly biased towards the largest superfamily (CATH code 2.60.40.10) that contains 220 representatives and includes the antibody protein domains. One possible explanation for the peaks is that the homologous domains of the antibody proteins have arisen through gene duplication but have diverged to perform different functions within the antibody molecule. The variable domains from the light and heavy chains (VL and VH) are required for binding to the antigen while the constant domains (CL and CH) have a structural role. The peak around 23% arises from identities between two functionally similar domains in the same antibody molecule (i.e. VLVH or CLCH). The lower peak around 10% sequence identity is caused by the identities from dissimilar domains (i.e. VLCL, VHCH, VLCH and VHCL). The peaks above 30% are due to comparing the same domains (e.g. VLVL or VHVH) from different antibody molecules in the PDB.

These observations suggest that once functional constraints have been removed, sequences can diverge much further while retaining the same fold. As the function diverges within a homologous family (e.g. in paralogous proteins) the degree of sequence and structural similarity resembles that of analogous proteins and it becomes difficult to distinguish between analogous proteins and very distant homologues.

Considering also structural similarity, we observe that in general analogous protein pairs often have lower SSAP scores than homologous pairs and <25% sequence identity (see, for example, Figure 7DGo for the immunoglobulin fold). A full analysis of all the 117 239 pairwise relationships in the CATH-95 dataset showed that no analogues exhibit the combined criteria of SSAP >=80, sequence identity >=25% and residue overlap >=70%. As manual validation of homologues can be slow, even with the benefit of the DHS, these empirical cut-offs can be used to assign homologous proteins automatically to CATH superfamilies. In a recent dataset of 2646 new domain structures (1879 chains), 64% could be classified as homologues using cautious sequence identity cut-offs (>=35%), a further 9.8% could be assigned using stringent E-value cut-offs in PSI-BLAST (see Materials and methods). The combined sequence–structure criteria identified a further 2.9% of homologous pairs. Of the remaining domains, 4.6% were found to be distant homologues using the DHS Web pages and assigned to existing CATH superfamilies, 6.9% were assigned to new superfamilies within an existing fold group as they lacked the sufficient evidence to prove homology and 11.8% were identified as novel folds (see Figure 8Go). These automatic homologue assignment criteria in combination with the DHS should enable the CATH classification to keep pace with the rapid increase in structure determination expected from structural genomic initiatives.



View larger version (64K):
[in this window]
[in a new window]
 
Fig. 8. Pie chart to show the result of assigning 2646 recently deposited protein domains to homologous superfamilies in the CATH database using sequence comparison methods, combined sequence–structure criteria and the DHS.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The Dictionary of Homologous Superfamilies (DHS) provides a compendium of structural and functional information focused at the level of evolutionary relationships. It has emerged from the growing need to validate rigorously the existing CATH hierarchical classification through using functional information in addition to the existing sequence and structure comparison tools. The DHS resource provides multiple structural alignments for each superfamily together with derived consensus functional information, e.g. sequence motifs and ligand interactions. The alignments can be downloaded from the Web and will be updated with each release of the CATH database. Automatic and manual checking using the DHS functional tables and alignments has ensured that CATH superfamilies are more biologically coherent. This is increasingly important as structure databases are used as a resource for training and benchmarking fold recognition algorithms (e.g. GenTHREADER, Jones, 1999), testing sequence comparison methods (Brenner et al., 1998Go; Park et al., 1998Go; Salamov et al., 1999Go) and genome analysis (Salamov et al., 1998Go).

One of the main purposes of studying the 3D structure of a protein is to gain clearer insights into its specific function and biological role. However, for many protein families there are not yet any structural data, e.g. only ~1000 structural families are known compared with ~20 000 sequence families. For these cases information on biological role can sometimes be gleaned by considering the functions of related protein sequences. This technique of functional inheritance is now routinely applied when analysing and annotating novel genome sequences. Structural genomic initiatives have now been established (Pennisi, 1998Go) to determine structural representatives for each of the 20 000 sequence families, making the prospect of functional inheritance through structural similarity increasingly feasible. This highlights the growing importance of incorporating functional annotations for structures and related genomic sequences into the DHS.

The CATH Protein Family Database (CATH-PFDB) of genomic sequences has recently been established to complement the CATH structural database (Pearl et al., 2000Go). The CATH-PFDB contains over 100 000 clear homologues from the sequence databases (e.g. translated GenBank) that have been identified using PSI-BLAST to be closely related to structural domains in CATH. These sequences and their associated functional annotations (e.g. from SWISS-PROT) will be incorporated into the DHS to increase significantly the functional information available for each structural superfamily. A PSI-BLAST server (URL: http://www.biochem.ucl.ac.uk/bsm/PSI-CATH) has been developed to allow the user to scan the CATH database with a new protein sequence. Hits to CATH structures will be linked to the DHS to allow consideration of functional variation within the potential superfamily. This should help to improve the accuracy of functional inheritance for sequences assigned to CATH superfamilies, particularly for cases where there is greater functional diversity so that any properties should be inherited more cautiously.


    Acknowledgments
 
We acknowledge the support of the BBSRC and MRC. James Bray is in receipt of a BBSRC special studentship. Annabel Todd is supported by a BBSRC Case studentship in collaboration with Oxford Molecular Limited. Frances Pearl is supported by a BBSRC grant. Christine Orengo is supported by the Medical Research Council. We acknowledge support from the BBSRC for computing facilities. We thank Dr Roman Laskowski and Dr Andrew Martin for the use of their computer programs, Ian Sillitoe and William Valdar for helpful discussions and David Lee for the PSI-BLAST server.


    Notes
 
2 To whom correspondence should be addressed. E-mail: james{at}biochem.ucl.ac.uk Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Abola,E.E., Berstein,F.C., Bryant,S.H., Koetzle,T.F. and Weng,J. (1987) In Allen,F.H., Bergerhoff,G. and Sievers,R. (eds), Crystallographic Databases: Information Content, Sofware Systems, Scientific Applications. Commission of the International Union of Crystallography, Bonn, pp. 107–132.

Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Attwood,T.K., Flower,D.R., Lewis,A.P., Mabey,J.E., Morgan,S.R., Scordis,P., Selley,J.N. and Wright,W. (1999) Nucleic Acids Res., 27, 220–225.[Abstract/Free Full Text]

Bairoch,A. (1999) Nucleic Acids Res., 27, 310–311.[Abstract/Free Full Text]

Bairoch,A. and Apweiler,R. (1999) Nucleic Acids Res., 27, 49–54.[Abstract/Free Full Text]

Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer, E.L.L (1999) Nucleic Acids Res., 27, 260–262.[Abstract/Free Full Text]

Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F.F., Rapp,B.A. and Wheeler,D.L. (1999) Nucleic Acids Res., 27, 12–17.[Abstract/Free Full Text]

Bork,P. and Koonin,E.V. (1998) Nature Genet., 18, 313–318.[ISI][Medline]

Brenner,S.E., Chothia,C., Hubbard,T.J.P. and Murzin,A.G. (1996) Methods Enzymol., 266, 635–643.[ISI][Medline]

Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1997) Curr. Opin. Struct. Biol., 7, 369–376.[ISI][Medline]

Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Proc. Natl Acad. Sci. USA, 95, 6073–6078.[Abstract/Free Full Text]

Brown,N.P., Orengo,C.A. and Taylor,W.R. (1996) Comput. Chem., 20, 359–380.[ISI]

Chothia,C. (1984) Annu. Rev. Biochem., 53, 537–572.[ISI][Medline]

Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823–826.[Abstract]

Corpet,F., Gouzy,J. and Kahn,D. (1999) Nucleic Acids Res., 27, 263–267.[Abstract/Free Full Text]

Denessiouk,K.A., Denesyuk,A.I. Lehtonen,J.V., Korpela,T. and Johnson,M.S. (1999) Proteins: Struct. Funct. Genet., 35, 250–261.[ISI][Medline]

Gibrat,J.F., Madej,T., Spouge,J.L. and Bryant,S.H. (1997) Biophys. J., 72, 298.

Hadley,C. and Jones,D.T. (1999) Structure, 7, 1099–1112.[ISI][Medline]

Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) Nucleic Acids Res., 27, 215–219.[Abstract/Free Full Text]

Holm,L. and Sander,C. (1993) J. Mol. Biol., 233, 123–138.[ISI][Medline]

Holm,L. and Sander,C. (1997) In Gaasterland,T., Karp,P., Karplus,K., Ouzounis,C., Sander,C. and Valencia,A. (eds), Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA., pp. 140–146.

Holm,L. and Sander,C. (1999) Nucleic Acids Res., 27, 244–247.[Abstract/Free Full Text]

Hogue,C.W.V., Ohkawa,H. and Bryant,S.H. (1996). Trends. Biochem. Sci., 21, 226–229.[ISI][Medline]

Hubbard,T.J.P. and Blundell,T.L. (1987). Protein Engng, 1, 159–171.[Abstract]

Hubbard,T.J.P., Ailey,B., Brenner,S.E., Murzin,A.G. and Chothia,C. (1999) Nucleic Acids Res., 27, 254–256.[Abstract/Free Full Text]

Hughey,R. and Krogh,A. (1996) Comput. Appl. Biosci., 12, 95–107.[Abstract]

Jansonius,J.N. (1998) Curr. Opin. Struct. Biol., 8, 759–769.[ISI][Medline]

Jones,D.T. (1999) J. Mol. Biol., 287, 797–815.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kasuya,A. and Thornton,J.M. (1999) J. Mol. Biol., 286, 1673–1691.[ISI][Medline]

Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950.[ISI]

Laskowski,R.A., Hutchinson,E.G., Michie,A.D., Wallace,A.C., Jones,M.L. and Thornton,J.M. (1997) Trends Biochem. Sci., 22, 488–490.[ISI][Medline]

Marchler-Bauer,A., Addess,K.J., Chappey,C., Geer,L., Madej,T., Matsuo,Y., Wang,Y. and Bryant, S. (1999) Nucleic Acids Res., 27, 240–243.[Abstract/Free Full Text]

Martell,A.E. (1982) Adv. Enzymol. Relat. Areas Mol. Biol., 53, 163–199.[ISI][Medline]

Martin,A.C., Orengo,C.A., Hutchinson,E.G., Jones,S., Karmirantzou,M., Laskowski,R.A., Mitchell,J.B., Taroni,C. and Thornton,J.M. (1998) Structure, 6, 875–884.[ISI][Medline]

Matsuo,Y. and Bryant,S.H. (1999) Proteins: Struct. Funct. Genet., 35, 70–79.[ISI][Medline]

Merritt,E.A. and Bacon,D.J. (1997) Methods Enzymol., 277, 505–524.[ISI]

Milburn,D., Laskowski,R.A. and Thornton,J.M. (1998) Protein Engng, 11, 855–859.[Abstract]

Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998a) Protein Sci., 7, 2469–2471.[Abstract/Free Full Text]

Mizuguchi,K., Deane,C.M., Blundell,T.L. Johnson,M.S. and Overington,J.P. (1998b) Bioinformatics, 14, 617–623.[Abstract]

Murzin,A.G. (1996) Curr. Opin. Struct. Biol., 6, 386–394.[ISI][Medline]

Murzin,A.G. (1998) Curr. Opin. Struct. Biol., 8, 380–387.[ISI][Medline]

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[ISI][Medline]

Needleman,S.B. and Wunsch,C.D. (1970). J. Mol. Biol., 48, 443–453.[ISI][Medline]

Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) (1992) Enzyme Nomenclature. Academic Press, New York.

Orengo,C.A. (1994) Curr. Opin. Struct. Biol., 4, 429–440.[ISI]

Orengo,C.A. (1999) Protein Sci., 8, 699–715.[Abstract]

Orengo,C.A., Brown,N.P. and Taylor,W.R. (1992) Proteins, 14, 139–167.[ISI][Medline]

Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, 631–634.[ISI][Medline]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Orengo,C.A., Pearl,F.M.G., Bray,J.E., Todd,A.E., Martin,A.C., Lo Conte,L. and Thornton,J.M. (1999) Nucleic Acids Res., 27, 275–279.[Abstract/Free Full Text]

Overington,J.P., Johnson,M.S., Sali,A. and Blundell,T.L. (1990). Proc. R. Soc. London, Ser B, 241, 132–145.[ISI][Medline]

Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, 349–354.[ISI][Medline]

Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 1201–1210.[ISI][Medline]

Pearl,F.M.G., Lee,D., Bray,J.E., Sillitoe,I., Todd, A.E., Harrison,A.P., Thornton,J.M. and Orengo,C.A. (2000) Nucleic Acids Res., 28, 277–282.[Abstract/Free Full Text]

Pennisi,E. (1998) Science, 279, 978–979.[Free Full Text]

Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) Nucleic Acids Res., 27, 229–232.[Abstract/Free Full Text]

Rost,B. (1999) Protein Engng, 12, 85–94.[Abstract/Free Full Text]

Russell,R.B. and Barton,G.J. (1992) Proteins: Struct. Funct. Genet., 20, 309–323.

Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997) J. Mol. Biol., 269, 423–439.[ISI][Medline]

Salamov,A.A., Suwa,M., Orengo,C.A. and Swindells,M.B. (1998) Protein Sci., 8, 771–777.[Abstract]

Salamov,A.A., Suwa,M., Orengo,C.A. and Swindells,M.B. (1999) Protein Engng, 12, 95–100.[Abstract/Free Full Text]

Salem,G.M., Hutchinson,E.G., Orengo,C.A. and Thornton,J.M. (1999) J. Mol. Biol., 287, 969–981.[ISI][Medline]

Sander,C. and Schneider,R. (1991) Proteins: Struct. Funct. Genet., 9, 56–68[ISI][Medline]

Sayle,R.A. and Milner-White,E.J. (1995) Trends Biochem. Sci., 20, 374–376.[ISI][Medline]

Schmidt,R., Gerstein,M. and Altman,R.B. (1997) Protein Sci., 6, 246–248.[Abstract/Free Full Text]

Siddiqui,A.S. and Barton,G.J. (1997) http://circinus.ebi.ac.uk:8080/3Dee/help/help_intro.html.

Sowdhamini,R. and Blundell,T.L. (1995) Protein Sci., 4, 506–520.[Abstract/Free Full Text]

Sowdhamini,R., Rufino,S.D. and Blundell,T.L. (1996) Folding Des., 1, 209–220.[ISI][Medline]

Sowdhamini,R., Burke,D.F., Huang,F., Mizuguchi,K., Nagarajaram,H.A., Srinivasan,N., Steward,R.E. and Blundell,T.L. (1998) Structure, 6, 1087–1094.[ISI][Medline]

Taylor,W.T. (1997) Protein Engng, 10, 743–746.[Free Full Text]

Taylor,W.T. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Todd,A.E., Orengo,C.A. and Thornton,J.M. (1999a) Protein Engng, 12, 375–379.[Abstract/Free Full Text]

Todd,A.E., Orengo,C.A. and Thornton,J.M. (1999b) Curr. Opin. Chem. Biol., 3, 548–556.[ISI][Medline]

Wallace,A.C., Laskowski,R.A. and Thornton,J.M. (1995) Protein Engng, 8, 127–134.[Abstract]

Westhead,D.R., Hatton,D.C. and Thornton,J.M. (1998) Trends Biochem. Sci., 23, 35–36.[ISI][Medline]

Received August 8, 1999; revised December 14, 1999; accepted January 6, 2000.