DOMPLOT: a program to generate schematic diagrams of the structural domain organization within proteins, annotated by ligand contacts

A.E. Todd1, C.A. Orengo1 and J.M. Thornton1,2,3

1 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT and 2 Department of Biochemistry and Molecular Biology, Birkbeck College, Malet Street, London WC1E 7HX, UK


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
A program is described for automatically generating schematic linear representations of protein chains in terms of their structural domains. The program requires the co-ordinates of the chain, the domain assignment, PROSITE information and a file listing all intermolecular interactions in the protein structure. The output is a PostScript file in which each protein is represented by a set of linked boxes, each box corresponding to all or part of a structural domain. PROSITE motifs and residues involved in ligand interactions are highlighted. The diagrams allow immediate visualization of the domain arrangement within a protein chain, and by providing information on sequence motifs, and metal ion, ligand and DNA binding at the domain level, the program facilitates detection of remote evolutionary relationships between proteins.

Keywords: protein domains/protein–ligand contacts/protein structure/schematic diagrams/sequence motifs


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
Many polypeptide chains fold into two or more distinct structural regions, or domains. Although there is no absolute definition of a structural domain, they are commonly regarded as local, compact, semi-independent units (Richardson, 1981Go). Recognition of the modular composition of a protein is essential, owing to the role of domains in protein structure, function and evolution.

Domains can be identified by visual inspection of the protein structure. However, with the rapid growth of the protein structure database, the need for automatic assignments of domains has become increasingly important to allow the efficient maintenance of structural classifications such as CATH (Orengo et al., 1997Go). Several algorithms for the automatic assignment of domains from the co-ordinates have been devised (Holm and Sander, 1994Go; Islam et al., 1995Go; Siddiqui and Barton, 1995Go; Swindells, 1995Go). They focus on varying criteria in the definition of a structural domain.

Tools to visualize the modular arrangement of protein sequences, in which domains are identified by sequence comparison, are now available (Gouzy et al. 1997Go; Schultz et al., 1998Go). It is important to note that their definition of a `domain', as a sequence motif, is different from that used here, although a sequence motif may encompass an entire structural domain.

Here we describe a program, DOMPLOT, which can produce schematic descriptions of protein chains in terms of their constituent structural domains. These diagrams are based on three-dimensional structural information, unlike the above, and incorporate annotations derived from the co-ordinates. The diagrams are in the form of a series of linked boxes, each box corresponding to all or part of a domain. The program is completely general in that it can work for any protein chain provided that the co-ordinates are in Protein Data Bank (PDB) format (Bernstein et al., 1977Go), and that the domains have previously been assigned. The output is in Postscript format (Adobe Systems Inc., 1985Go).

The diagrams illustrate the pattern of interactions between the domain and any metal ions, ligands and/or nucleic acids with which it binds, allowing a fast analysis of the location of specific intermolecular interactions with respect to the domain sequence. PROSITE (Bairoch et al., 1997Go) information is also included.

The program has found a number of applications. The output gives a simple representation of the results of more complex atomic comparisons at the domain level, facilitating the comparison of many structures. It has been adapted to read multiple structural alignment files generated by CORA (Orengo, 1998Go). The output provides a concise summary of regions of structural equivalence between two or more protein chains. Conservation of residue interactions is immediately apparent, making it a useful tool to aid detection of protein homology. It can also be used for the comparison and verification of domain assignments.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
Domain assignment

The domain assignments used here are those derived with a consensus approach (Jones et al., 1998Go), which was developed for the structural database CATH, although in practise any domain definitions can be used. The consensus approach applies three independent algorithms, PUU (Holm and Sander, 1994Go), DETECTIVE (Swindells, 1995Go) and DOMAK (Siddiqui and Barton, 1995Go). When the three algorithms are in agreement, the domain boundaries are assigned automatically, else the protein structure is inspected by eye.

Domain sizes

Almost one third of structural domains are discontinuous in that they are constructed from two or more non-sequential segments of the polypeptide chain (Jones et al., 1998Go). DOMPLOT must be able to deal with such domains, and it is necessary to determine the number of residues within each sequence segment. The size of each segment within each domain is defined as the total number of residues within the sub-sequence, rather than the number of residues of known structure within the domain segment.

Interaction information and PROSITE motifs

Annotation by ligand contacts requires a list of intermolecular interactions, generated by the program GROW (Milburn et al., 1998Go). GROW processes the output of HBPLUS (McDonald and Thornton, 1994Go), an algorithm which identifies hydrogen bonds and other non-bonding interactions for a given PDB file. All possible positions for hydrogen atoms (H) are computed for donor atoms (D) which satisfy specified geometrical criteria with acceptor atoms (A) in the vicinity. The criteria used here are that the H–A distance is <2.7 Å, the D–A distance is <3.3 Å, the D–H–A angle is >90° and the H–A–AA angle is >90°, where the AA atom is the one attached to the acceptor. Non-bonded contacts are defined as those between atoms less than 3.9 Å apart.

GROW classifies all interactions identified by HBPLUS according to the pairs PP, PL, PN, PM, NL NM and LM (where P represents protein; L, ligand; N, nucleic acid and M, metal). Only the first four types of intermolecular interactions are relevant in DOMPLOT. If active-site information is available in the PDB file, it is stored also.

PROSITE information is obtained from a file listing all PROSITE motifs for each protein structure in the PDB (Kasuya and Thornton, 1999Go). Common motifs such as those corresponding to glycosylation sites are usually not included.

Drawing the plot

Figure 1AGo shows a MOLSCRIPT (Kraulis, 1991Go) diagram of the ATPase fragment of a heat-shock cognate 70 kDa protein (PDB code 1atr) (O'Brien and McKay, 1993Go). A simplified DOMPLOT picture, highlighting the domain organization using the same colour-coding, is shown in Figure 1BGo, and the standard DOMPLOT diagram of the protein chain is shown in Figure 1CGo.





View larger version (77K):
[in this window]
[in a new window]
 
Fig. 1. (A) MOLSCRIPT diagram of the ATPase fragment of a heat-shock cognate 70 kDa protein (PDB code, 1atr) with the ligands magnesium adenosine diphosphate and phosphate bound. Each of the four domains is identified by a different colour. (B) A simplified DOMPLOT diagram of the single chain of 1atr. Boxes represent segments of the polypeptide chain which are assigned to domains, and their length is proportional to the number of residues in the sub-sequence. The domain boxes are coloured according to the scheme used in Figure 1AGo. Horizontal lines represent residues which are not assigned to a domain, although two sequential segment boxes are separated by a short linker to improve the clarity of the output. The numbers above the ends of each box correspond to the PDB numbers of the first and last residue in each domain segment. (C) The standard DOMPLOT diagram of the single chain of 1atr. Coloured vertical lines within the boxes highlight residues involved in ligand interactions (with black lines representing active-site residues), and coloured horizontal lines below the boxes denote PROSITE motifs. Domain assignments, CATH numbers and ligand and PROSITE information are given in plain text below the box plot. Ligands and PROSITE names are coloured according to the colouring scheme used in the box plot.

 
Boxes represent sub-sequences or `segments' that are assigned to domains. The length of the box is dependent upon the number of residues within the domain segment. The number, and insertion code if applicable, as given in the PDB file, of the first and last residue in each segment are placed above each end of the box. Single horizontal lines or `links' denote `fragments' comprising residues which are not assigned to domains. Also, two sequential segments that are assigned to different domains are separated by a short link to improve the clarity of the output.

Gaps of residues for which there is no structural information are represented by dotted lines, the lengths of which correspond to the size of the gap. (If the gap is at the N- or C-terminus, which can be very common, only a short dotted line is drawn regardless of the gap's length.) Gaps situated between domain segments are represented by dotted fragment links, whereas those lying within a domain segment are represented by two dotted lines so as to maintain the continuity of the domain segment box (not shown in this example).

Residues involved in intermolecular interactions as given in the GROW output file are represented by a vertical line at the appropriate position along the segment box. The line is coloured to indicate the identity of the ligand with which it interacts. Black vertical lines denote active-site residues. Lines with two or more colours indicate residues bound to two or more ligands. Labels which give the residue number (and insertion code) of the interacting residues are placed as close as possible to the corresponding interaction lines so as to avoid overlap. A horizontal line placed immediately below all or part of a segment box and/or fragment line denotes a PROSITE motif. These lines are coloured to distinguish between two or more motifs with different PROSITE accession codes.

The boundaries of the segments within each domain are given in plain text below the graphics, as well as those ions, ligands and/or nucleic acid chains with which the domain interacts. The CATH number of the homologous superfamily to which each domain is assigned may also be given, as well as the name, code and sequence of each PROSITE motif. The bound molecules and PROSITE names are coloured according to the colouring scheme used in the box plot.


    Applications
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
Flavin-binding TIM barrel domains

The TIM barrel owes its name to chicken triosephosphate isomerase in which the {alpha}/ß barrel was first observed. The TIM barrel structural domain comprises eight parallel ß-strands surrounded by seven or eight {alpha}-helices, and is found in a wide variety of enzymes. Whilst the evolutionary path of many fold families is well defined, that of the {alpha}/ß barrel is less clear. The low sequence similarity and diverse functional activity amongst the barrel enzymes support a mechanism by which they have converged to a stable protein fold. However, the active site is invariably found at the C-terminus of the barrel domain, suggesting also that {alpha}/ß barrel proteins may have diverged from a common ancestor.

Figure 2AGo illustrates the structural domain organization of four protein chains which contain a flavin-binding TIM barrel domain. The PDB codes 1oya (Fox and Karplus, 1994Go), 2tmd (Lim et al., 1986Go; Mathews et al., 1993, unpublished data), 1gox (Lindqvist, 1989Go) and 1fcb (Xia and Mathews, 1990Go) correspond to old yellow enzyme (OYE), trimethylamine dehydrogenase (TMDH), glycolate oxidase (GO) and flavocytochrome b2 (FCB2) respectively. Figure 2BGo gives a DOMPLOT picture for the output of a CORA multiple structural alignment (Orengo, 1998Go) of the TIM barrel domains in each protein. This program compares the structural environments of the residues and employs a double dynamic programming algorithm. From this concise DOMPLOT summary of the detailed structural comparison (Figure 2BGo) it is apparent that there are two structurally similar pairs of domains: 1oya and the first domain of chain A of 2tmd, and 1gox and the second domain of chain B of 1fcb. The high conformational similarities of the TIM barrel domains of GO and FCB2 (Lindqvist et al., 1991Go), and of OYE and TMDH (Fox and Karplus, 1994Go) have previously been established. Pairwise sequence identities, SSAP scores (Taylor and Orengo, 1989Go) and r.m.s. deviations of these four domains are given in Table IGo.




View larger version (45K):
[in this window]
[in a new window]
 
Fig. 2. (A) DOMPLOT diagrams illustrating the structural domain organisation of 1oya, 2tmd (chain A), 1gox and 1fcb (chain B) (see text for references). The TIM barrel domains in each protein chain have been aligned under one another and are coloured red. (The Postscript file was edited to illustrate the first domain of 1fcbB which was not seen in the electron density map and is therefore not drawn automatically as a box by the DOMPLOT program.) The other domains of 2tmdA and 1fcbB have different colours to indicate that they are not homologous, according to the CATH structural classification scheme. The enzyme activities of the four proteins are as follows (see text for abbreviations): (a) OYE, E.C. number 1.6.99.1, reduces the olefinic bond of many {alpha},ß-unsaturated carbonyl compounds (Karplus et al., 1995Go), NADPH + acceptor = NADP+ + reduced acceptor; (b) TMDH, E.C. 1.5.99.7, trimethylamine + H2O = dimethylamine + formaldehyde; (c) GO, E.C. 1.1.3.15, glycolate + O2 = glyoxylate + H2O2 (d) FCB2, E.C. 1.1.2.3, (S)-lactate + 2 ferricytochrome c = pyruvate + 2 ferrocytochrome c. (B) DOMPLOT diagram for the output of a CORA multiple structural alignment of four TIM barrel domains. The secondary structure consensus is given at the top of the plot, and E, H, G and T correspond to ß-strand, {alpha}-helix, 310-helix and turn respectively (Kabsch and Sander, 1983Go). The diagram is analogous to a multiple sequence alignment but derived from the structural information, with a gap implying no structural equivalence at that position. The gaps in 1oya closely correspond to those in 2tmdA, and likewise for the pair 1gox and 1fcbB, indicating two structurally similar pairs of domains. The colouring scheme of the residue interaction consensus is identical to that used in colouring the residue interaction lines, and hence all of the conserved ligand contacts correspond to binding of FMN. The CATH number of the domains and the name of the PROSITE motif (in 1gox and 1fcb) are also given. Residues represented in the plot are as follows: 1–399 (1oya), 1–384 (2tmd, chain A), 1–188 and 198–358 (1gox), 100–299 and 312–510 (1fcb, chain B).

 

View this table:
[in this window]
[in a new window]
 
Table I. Pairwise sequence identities, SSAPa scores (in bold) and r.m.s.d. (in italics), of the four TIM barrel domains
 
All four TIM barrel domains bind flavin mononucleotide as a cofactor, although TMDH also binds an iron–sulphur cluster, Fe4S4. The substrates differ widely amongst the four proteins, with the exception of GOX and FCB2, whose substrates differ only by a methyl group. The residue interaction consensus in Figure 2BGo indicates that eight structurally equivalent residues interact with FMN in all four TIM barrels, although many more residues involved in flavin binding in each of the four proteins are in almost equivalent positions in the alignment. The detailed structural comparison of GO, FCB2 and TMDH by Lindqvist et al. (1991) identified the similarity in FMN binding in these three enzymes, an observation which is highlighted in the DOMPLOT diagram.

The question of whether conservation of the position of the active site in the TIM barrels is evidence for their divergence from a common ancestor has arisen time and again (Branden, 1991Go; Farber and Petsko, 1990Go). Farber and Petsko clustered GO, FCB2 and TMDH into the same homologous sub-family on the basis of structural and chemical properties, and given the high structural similarity of OYE and TMDH, old yellow enzyme would also be classified in this group. In contrast, Scrutton (1994) grouped the pairs OYE and TMDH, and GO and FCB2, into two separate families, given the sequence identities and modular assembly of these and other flavin oxidase/dehydrogenase enzymes. He reached no conclusion on the evolutionary origin of these two sub-classes. Wilmanns et al. (1991) detected sequence similarity in the phosphate binding site regions of nine TIM barrels, including GO, FCB2 and TMDH, and proposes that this observation is indicative of divergent evolution.

The example illustrates how DOMPLOT can be used to provide a clear and concise summary of detailed structural comparisons, and with the annotation of ligand contacts, structural conservation of binding sites is immediately evident. The high conformational similarity of the FMN-binding site in all four barrels, yet the low sequence identity between the two homologous pairs suggests that the pairs may have diverged from an early ancestor. The program is a valuable tool, when used in conjunction with others to aid detection of remote protein homology.


    Availability
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
DOMPLOT is written in C, and a DOMPLOT server is available at http://www.biochem.ucl.ac.uk/bsm/domplot/. The user may enter one or more PDB codes (with chain identifiers, if applicable), select the output type (simple or annotated) and a diagram is generated automatically. Alternatively, the user may upload a set of co-ordinates in PDB format, provide domain boundaries and the default HBPLUS interaction parameters may be altered. The diagrams may be downloaded in Postscript format.


    Acknowledgments
 
A.E.T. is supported by a BBSRC special studentship and is sponsored by Oxford Molecular. C.A.O. is supported by the Medical Research Council. We acknowledge support from the BBSRC for computing facilities.


    Notes
 
3 To whom correspondence should be addressed Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Applications
 Availability
 References
 
Adobe Systems Inc. (1985) Postscript Language Reference Manual. Addison Wesley Press, Reading, MA.

Bairoch,A., Bucher,P. and Hofmann,K. (1997) Nucleic Acid. Res., 25, 217–221.[Abstract/Free Full Text]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Branden,C.-I. (1991) Curr. Opin. Struct. Biol., 1, 978–983.

Farber,G.K. and Petsko,G.A. (1990) Trends Biochem. Sci., 15, 228–234.[ISI][Medline]

Fox,K. and Karplus,P.A. (1994) Structure, 2, 1089–1105.[ISI][Medline]

Gouzy,J., Eugene,P., Greene,E.A., Kahn,D. and Corpet,F. (1997) Comput. Appl. Biosci., 13, 601–608.[Abstract]

Holm,L. and Sander,C. (1994) Proteins, 19, 256–268.[ISI][Medline]

Islam,S.A., Luo,J. and Sternberg,M.J.E. (1995) Protein Engng, 8, 513–525.[Abstract]

Jones,S., Stewart,M., Michie,A., Swindells,M.B., Orengo,C. and Thornton,J.M. (1998) Protein Sci., 7, 233–242.[Abstract/Free Full Text]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Karplus,P.A., Fox,K.M. and Massey,V. (1995) FASEB J., 9, 1518–1526.[Abstract/Free Full Text]

Kasuya,A. and Thornton,J.M. (1999) J. Mol. Biol., 286, 1673–1691.[ISI][Medline]

Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, 946–950.[ISI]

Lim,L.W., Shamala,N., Mathews,F.S., Steenkamp,D.J., Hamlin,R. and Xuong,N. (1986) J. Biol. Chem., 261, 15140–15146.[Abstract/Free Full Text]

Lindqvist,Y. (1989) J. Mol. Biol., 209, 151–166.[ISI][Medline]

Lindqvist,Y., Branden,C.-I., Mathews,F.S. and Lederer,F. (1991) J. Biol. Chem., 266, 3198–3207.[Abstract/Free Full Text]

McDonald,I.K. and Thornton,J.M. (1994) J. Mol. Biol., 238, 777–793.[ISI][Medline]

Milburn,D., Laskowski,R. and Thornton,J.M. (1998) Protein Engng, 11, 855–859.[Abstract]

O'Brien,M.C. and McKay,D.B. (1993) J. Biol. Chem., 268, 19656–19658.[Abstract/Free Full Text]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Orengo,C.A. (1998) Protein Sci., in press.

Richardson,J.S. (1981) Adv. Protein Chem., 34, 246–253.

Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) Proc. Natl Acad. Sci. USA, 95, 5857–5864.[Abstract/Free Full Text]

Scrutton,N.S. (1994) BioEssays, 16, 115–122.[ISI][Medline]

Siddiqui,A.S. and Barton,G.J. (1995) Protein Sci., 4, 872–884.[Abstract/Free Full Text]

Swindells,M.B. (1995) Protein Sci., 4, 103–112.[Abstract/Free Full Text]

Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 1–22.[ISI][Medline]

Wilmanns,M., Hyde,C.C., Davies,D.R., Kirschner,K. and Jansonius,J.N. (1991) Biochemistry, 30, 9161–9169.[ISI][Medline]

Xia,Z.-X. and Mathews,F.S. (1990) J. Mol. Biol., 212, 837–863.[ISI][Medline]

Received September 22, 1998; revised January 1, 1999; accepted January 22, 1999.