Systèmes Moléculaires et Biologie Structurale, LMCP, CNRS UMR C7590 Universités Pierre et Marie Curie (P6) et Denis Diderot (P7), Tour 16,Case 115, 4 Place Jussieu, 75252 Paris cedex 05, France.E-mail: poupon{at}lmcp.jussieu.fr
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: comparative study/hydrophobic core/immunoglobulin fold/multiple sequence alignment/protein folding
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Classical Ig-like domains are composed of 710 ß-strands, distributed between two sheets with typical topology and connectivity. However, recent structural analyses revealed additional secondary structure elements in this classical scaffold, such as additional strands [SOD, DPA (PapD), RSY] or helices (CTM, HCY) (see Table I for nomenclature). In this paper, we report an all-against-all structural comparison of 52 distinct Ig-like domains (1326 pairs), having less than 55% pairwise sequence identity. The structures considered were selected from the PDB (Table I
). Structural-based sequence analysis and comparison of structural features were performed in order to characterize sequencestructure compatibility. Our observations led us to propose a new structural classification within the IgFF.
|
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Superimpositions within each group were generated by various automatic programs [COMPOSER (Sutcliffe et al., 1987), DALI (Holm and Sanders, 1993
) and COMPARER (Sali and Blundell, 1990
)] and checked manually. Protein pairs belonging to different groups were superimposed manually using a pseudo-iterative method comparable to that of Hubbard and Blundell (1987). For this type of comparison the use of automatic programs was impossible because of the differences in the orientation of the two sheets and the presence in some structures of extra strands.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
3D superimposition and sequence analysis
Structures were compared by finding the optimal superimposition between the pair considered while avoiding a unique reference structure with which all domains would be compared. The definition of a `mean structure' for the IgFF does not make sense because of the great diversity in structure and function. Such problems in structure superimposition have been widely studied (Lesk et al., 1986; Godzik et al., 1993
; Johnson et al., 1993
; Maiorov and Crippen, 1994
) and cannot be solved by the existing automatic methods. Variations in the lengths of regular secondary structures (strands) and variable loop regions make the alignment of the whole domains impossible (Figure 1
). A common core was defined among the 35 structurally equivalent residues, localized in the six strands common to the 52 structures (strands A, B, C, E, F and G). The fourth strand, numbered according to its appearance in the sequence, cannot be aligned for all the domains studied owing to its variable localizations in sheet I (strand D) or in sheet II (strand C'). The variable domains contain both strands C' and D (Figure 2
) and sometimes a ninth strand C''.
|
|
|
|
In order to investigate the similarity between the different groups defined in the IgFF (Figure 2), the topohydrophobic positions in each of them were determined (Table II
). A score, which depends on the number of topohydrophobic positions in each group (ni for the group i, nj for the group j) and on the number of topohydrophobic positions common to two groups (Ni,j), was defined:
![]() |
and computed for all possible pairs of groups (Table III). In each group the number of topohydrophobic positions is close to what is expected for an homogeneous family (710% of the total amino acids), except for the groups `Others' (which was already known to be heterogeneous) and V.
|
Hydrogen bonds in the the Ig-like domains are mainly conserved in the structural common core (Figure 5) (Kabsch and Sander, 1983
). However, some domains deviate from this general scheme, in that not all possible H-bonds are formed or alternatively are established between non-equivalent residues. In many domains, the external strands A and G present geometrical distorsions known as ß-bulges (Richardson, 1977
; Chan et al., 1993
), which lead to an imperfect general H-bond network. Hydrogen bonds have been extensively shown to be important for the stability and dynamics of protein domains (Pfuhl et al., 1997
; Vogt and Argos, 1997
) and probably play an important role in Ig domains.
|
Based on the structural alignment a typical hydrophobic core of the Ig fold can be described. Solvent accessibity surfaces (SAS) were calculated for each residue (Figure 6A). These calculations show that the structural core can be virtually divided into three parts: buried positions (SAS < 20 Å2), internal positions (20 Å2 < SAS < 50 Å2) and exposed positions (SAS > 50 Å2). The most buried positions are B3, C3 and F3 (SAS < 10 Å2). For strands B, C, E and F, amino acids with side chains pointing towards the interior of the protein have SAS < 20 Å2 and amino acids with side chains pointing towards the exterior of the protein have SAS between 20 and 50 Å2. For other strands (A, C', D and G), SAS of residues oriented towards the interior (between 20 and 50 Å2)is still lower than that of residues oriented towards the exterior (>50 Å2 ), but the values are higher.
|
The hydrophobic core described here is a key signature with an impact on the structural behavior of the Ig fold.
The disulfide bonds
A dominant feature of canonical Ig domains is the disulfide bridge connecting strands B and F. However, in several members of the family, disulfide bonds, when they exist, involve two strands, a strand and a loop or two loops. The disulfide bridge of Ig molecules is invariably located between positions B3 and F3. Only ~0.5% of the sequences of the KABAT bank (Martin, 1996) lack these cysteine residues, with consequent loss of their biological activities (Proba et al., 1997
) in the same manner as Ig mutants with substitutions at the cysteine residues. However, a functional antibody lacking the disulfide bridge has been observed (Rudikoff and Pumphrey, 1986
).
The usual method for identifying Ig-like domains consists in checking the length of the fragment between the two cysteine residues. Indeed, the number of residues within the disulfide bridge varies significantly as the domain type changes: 6376 (V), 5464 (C1), 3843 (C2) and 4451 (I) (Smith and Xue, 1997).
However, the disulfide bridge is not a common feature of the IgFF. Although expected to be important for the correct and stable folding of an Ig domain, many Ig-like domains lack the B3F3 disulfide bond, raising the question of its actual involvment in the folding pathway. Furthermore, some proteins, such as TLK, NFK1, NFK2, BGL, CTM and GGT3, exhibit free cysteine residues. The impact of the classical disulfide bridge occurring within Ig-like domains can be evaluated as follows. (a) The absence of the disulfide bridge or the occurrence of a bridge in an atypical position explains the larger separation of the two sheets in comparison with classical domains, as reflected by the distance between the C positions B3 and F3 (6.17.1 and 7.211.7 Å for domains with and without the canonical disulfide bridge, respectively) (Table II
). The role of this disulfide bridge in the compactness of the domain has been confirmed by mutagenesis of an Ig-variable domain and demonstrates the ability of the protein to accommodate the absence of the cysteine residues to maintain its fold (Proba et al., 1997
). (b) In domains lacking the disulfide bridge, the cysteine residue is replaced by a strong hydrophobic residue, the side chains of which maintain the hydrophobic core formation.
In conclusion, the disulfide bond may have more of a functional than a structural role. The absence of this structural constraint in many domains may allow adaptation to specific biological functions or to particular structural features, such as the insertion of additional secondary structure in the domain, and may enhance the assembly of many Ig-like domains such as those of the fibronectin type.
Structural classification of the IgFF
Williams and Barclay (1988) divided classical Ig domains into three topological domain subtypes: C1 (constant 1), C2 (constant 2) and V (variable). The resolution of many structures of Ig-like domains has revealed new topological subtypes including subtype I (intermediate) (Harpaz and Chotia, 1994), S (switched) and H (hybrid) types (Bork et al., 1994). The present analysis is in agreement with these studies and extends the comparison to Ig-like domains possessing additional strands, such as the structures of SOD (superoxide dismutase), hemocyanin, DPA (PapD) domain 2 and cytochrome f. The only criterion required is the occurrence in the domain of a topology and connectivity similar to those of immunoglobulins (Halaby and Mornon, 1998
). Domains that are distant in terms of angles between sheets, twists in some strands or difficult superimposition are also included in our study. The extension of the previous structural classifications to the newly identified structures, combined with a sequence analysis of the Ig-like domains led us to define two new subtypes: C3 (constant 3) and C4 (constant 4) (Table II
). The discrimination of these two groups is justified by the differences in sequence (Fn3 and C4 have different hydrophobic cores, the only common feature between them being the presence of a tyrosine in position C1, Fn3 proteins have a tryptophan residue in position B5, an aromatic residue in position C5 and a tyrosine residue in position F1, and none of these is found in proteins of the C4 sub-family), together with structural characteristics (proteins of the C4 sub-family have two conserved disulfide bonds, none of them is found in the proteins of the Fn3 sub-family; proteins of the C4 sub-family have two ß-strands forming a small sheet perpendicular to the two canonical ones, implicated in the active site) (Table II
). This discrimination can also be justified by the values obtained for the Si,j parameters: the highest value is obtained when comparing the sub-families I and C1, the discrimination between these two being largely accepted, so it seems reasonable to split Fn3C4 and Fn3C3, as in both cases the value obtained is much lower.
The information contained in the structural distance matrix (r.m.s.d. values) is illustrated through hierarchical clustering [using the program MOLPHY (Saitou and Nei, 1987)] as shown in Figure 7
. The distance between the 52 proteins studied, measured by the r.m.s.d. values, is coherent with the classification in subgroups. However, the tree reported in this study was established on the basis of structural similarity and should not be directly compared with trees constructed on the basis of sequence comparison. Cross-comparison of the 52 Ig-like domains reveals a coherent clustering into subclasses, which together with the sequence analysis results in a new classification of Ig-like domains.
|
|
|
where n1 and n2 are the number of members in subgroups G1 and G2, respectively, with
|
where r.m.s.(P1,P2) and id(P1,P2) are the root mean square deviation and the sequence identity, respectively, between the two proteins P1 and P2 belonging to groups G1 and G2.
Different hypotheses have been made mainly to explain how the primordial domain might have gained or lost a strand, leading to intermediate structures. Depending on the authors, the original domain might be the V domain (Williams and Barclay, 1988) or the C2 domain (Hunkapiller and Hood, 1989
; Smith and Xue, 1997
).
Since several Ig-like domains did not cluster with any of the structural sets described above (NCD, CTM, NFK2), additional subclasses of the Ig fold must exist and should be more documented when new 3D structures are solved. The NCD differs from a V domain by the localization of strand A between the two sheets and the absence of hydrogen bonds between strands A and B. The CTM domain presents nine strands as a variable domain, but the connectivity between C and E strands is atypical: the topology of the CTM domain is AA'BCDC'EFG (instead of AA'BCC'DEFG for a variable domain). The second domain of NFK could be described as intermediate between variable domains (same number of strands) and constant domain (a maximum of 14% sequence identity with C4 domains and with bacterial chitinase within the whole superfamily).
Conclusion
In a previous paper, we showed that the immunoglobulin fold family (IgFF) comprises a heterogeneous group of proteins sharing structural similarity but exhibiting a wide range of functions, species and tissue distribution. In this paper, 52 Ig-like domains found in the PDB were compared in order to define and characterize sequence and structural constraints of the Ig fold. The structure-based multiple alignment of the sequences revealed low overall sequence identity (often in the 515% range) and no functional relationship. Geometrical features, such as secondary structure, hydrogen bonds, disulfide bridges and solvent exposure, were compared through 1326 pairs of Ig-like domains.
Within the compared Ig-like domains, a few residues form the common core. As a general rule, two sequences which share at least 30% sequence identity are considered to fold very similarly (Chothia and Lesk, 1986; Schneider and Sander, 1991
). The IgFF is remarkable in that most of the Ig-like domains display <10% sequence identity. Many studies have shown that the folding pattern of a protein is dependent not only on its sequence, but also implicitly on its overall amino acid composition (Nakashima et al., 1986
; Chou, 1989) and that the size of the protein and the percentage of each amino acid can be used to predict the folding type. In the IgFF domains, most of the residues constituting the common core are, as expected, hydrophobic and are concentrated in a small number of conserved positions, probably responsible for maintenance of the Ig fold. Membership in this continually growing structural family requires specific interactions that stabilize the folded domains: (a) the formation of a typical hydrophobic core coded by the sequence; (b) the occurrence of specific tertiary interactions within the hydrophobic core; (c) in several subtypes, the introduction of disulfide bridges which influence the overall domain shape and also the symmetry between the two sheets.
Although these proteins retain a common fold, structural changes occur as their sequences diverge. Residue substitutions do not change the overall appearance of the ß-strands. However, changes in H-bond spacing, twists of strands or in one sheet relative to the other are observed to accommodate the sequence variation. Here we emphasize for a large sample that Ig-like domains have more structural (r.m.s.d. between C always <3.9 Å) than sequence similarities (identity mainly <25%). The hydrophobic core probably has a major impact on the uniqueness and stability of the Ig fold. As a general rule, mutations are not disruptive, as we observe a conservation of the properties of amino acids (hydrophobic/hydrophilic) along the alignment. A 29-residue structural core is common to all of the 52 considered domains, defined by the strands B, C, E and F and by six additional residues belonging to strands C' or D. The external strands A and G are more difficult to align owing to irregularities and distorsions in several domains. The ß-bulges occurring in strand A in some domains lead to the appearance of an additional strand A', such as in Ig-variable domains and many domains distantly related to Ig molecules.
Despite the wide sequence variations in Ig-like domains, the maintenance of the Ig fold seems to be enhanced by a conserved geometry of hydrogen bonds. In addition to sequence analysis of the Ig-like domains, the quantitative evaluation of their structural similarity appears to be important to build models for other members of the IgFF, to elucidate Ig folding principles and to predict new members through sensitive sequence comparisons (e.g. Mornon et al., 1997).
The Ig-like domains have been identified in various kingdoms including eukaryotes and prokaroytes, bacteria, viruses, fungi and plants [see Halaby and Mornon (1998) for a review]. Some of these domains lack known biological activities, such as those present in bacterial enzymes. The widespread occurrence of the Ig fold and its appearance in plants (Martinez et al., 1994) precludes any species or function exclusivity, i.e. the immune response, and raises the question of the origins of the fold. Is the Ig fold derived from a common ancestor, where in some cases the functional activities have been lost during evolution, or is it a stable structure to which many sequences have converged?
Members of the immunoglobin family are known to be phylogenetically related and Gelfand and Kister (1995) showed that there are 47 similar positions in the Ig sequences of the Kabat bank, eight being strictly conserved. Such identity cannot be extended to the IgFF, illustrated by the fact that no strict topohydrophobic positions can be identified for the whole family. Indeed, the study of tophydrophobic positions in the previously defined groups clearly demonstrated the homogeneity within the groups and the heterogeneity between them. Interestingly, the scores computed for each pair of groups in the IgFF and the phylogenetic tree calculated on the basis of sequence identity and r.m.s.d. values correlate well: the pairs of groups which are close to each other in the phylogenetic tree have high scores and those which are distant in the tree correspond to low scores. This result confirms that topohydrophobic positions are indeed related to structural and sequence features.
The determination of topohydrophobic positions being a very recent technique, it is difficult to quantify it accurately. However, the values for Si,j obtained in the present study fit nearly exactly with structural data: values obtained for two subsets of structures belonging to the same sub-family are always higher than 1.5 (data not shown) and consequently always higher than the values obtained by comparing two different sub-families.
The present study cannot definitely answer the difficult question of whether the IgFF evolved by divergent or convergent processes or both mechanisms. Indeed, structural and sequence conservation are high between subfamilies that are functionally correlated, while they are very low and often completely absent in unrelated proteins within the whole superfamily. At such low levels of sequence identity, it is very difficult to distinguish between convergent or divergent mechanisms of evolution (Burkhard, 1997). However, it appears more likely that both mechanisms may explain the IgFF: convergence of unrelated domains towards a simple and stable fold and divergence within each subtype.
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Burkhard,R. (1997) Fold. Des., 2, S19S24.[ISI][Medline]
Callebaut,I., Labesse,G., Durand,P., Poupon,A., Canard,L., Chomilier,J., Henrissat,B. and Mornon,J.-P. (1997) Cell. Mol. Life Sci., 53, 621645.[ISI][Medline]
Chan,A.W., Hutchinson,E.G., Harris,D. and Thornton,J.M. (1993) Protein Sci., 2, 15741590.
Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823826.[Abstract]
Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 222245.[ISI][Medline]
Gelfand,I.M. and Kister,A.E. (1995) Proc. Natl Acad. Sci. USA, 92, 1088410888.[Abstract]
Godzik,A., Skolnick,J. and Kolinski,A. (1993) Protein Engng, 6, 801810.[Abstract]
Halaby,D.M. and Mornon,J.P.E. (1998) J. Mol. Evol., 46, 389400.[ISI][Medline]
Harpaz,Y. and Chothia,C. (1994) J. Mol. Biol., 238, 528539.[ISI][Medline]
Harris,L. and Bajorath,J. (1995) Protein Sci., 4, 306310.
Holm,L. and Sanders,C. (1993) J. Mol. Biol., 233, 123138.[ISI][Medline]
Hubbard,T.J.P. and Blundell,T.L. (1987) Protein Engng, 1, 159171.[Abstract]
Hunkapiller,T. and Hood,L. (1989) Adv. Immunol., 44, 163.[ISI][Medline]
Johnson,M.S., Overington,J.P. and Blundell,T.L. (1993) J. Mol. Biol., 231, 735752.[ISI][Medline]
Jones,E.Y. (1993) Curr. Opin. Struct. Biol., 3, 846852.[ISI]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Lee,B.K. and Richards,F.M. (1971) J. Mol. Biol., 55, 379400.[ISI][Medline]
Lesk,A.M., Levitt,M. and Chothia,C. (1986) Protein Engng, 1, 7778.[ISI][Medline]
Maiorov,V.N. and Crippen,G.M. (1994) J. Mol. Biol., 235, 625634.[ISI][Medline]
Martin,A.C. (1996) Proteins, 25, 130133.[ISI][Medline]
Martinez,S.E., Huang,D., Szczepaniak,A., Cramer,W.A. and Smith,J.L. (1994) Structure, 2, 95105.[ISI][Medline]
Mornon,J.-P., Halaby,D., Malfois,M., Durand,P., Callebaut,I. and Tardieu,A. (1997) Int. J. Biol. Macromol., 22, 219227.[ISI]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biol. Chem., 99, 153162.
Pfuhl,M., Improta,S., Politou,A.S. and Pastore,A. (1997) J. Mol. Biol., 265, 242256.[ISI][Medline]
Poupon,A. and Mornon,J.-P. (1998) Proteins, 33, 329342.[ISI][Medline]
Poupon,A. and Mornon,J.-P. (1999) Theor. Chim. Acta, 101, 28.
Proba,K., Honegger,A. and Pluckthun,A. (1997) J. Mol. Biol., 265, 161172.[ISI][Medline]
Richards,F.M. (1985) The calculation of molecular volumes and areas for structures of known geometry. Acad. Press, Inc.
Richardson,J.S. (1977) Nature, 268, 495500.[ISI][Medline]
Rudikoff,S. and Pumphrey J.G. (1986,)Proc. Natl Acad. Sci. USA, 83, 78757878.[Abstract]
Saitou,N. and Nei,M. (1987) Mol. Biol. Evol., 4, 406425.[Abstract]
Sali,A. and Blundell,T.L. (1990) J. Mol. Biol., 212, 403428.[ISI][Medline]
Schneider,R. and Sander,C. (1991) Proteins, 9, 5668.[ISI][Medline]
Smith,D.K. and Xue,H. (1997) J. Mol. Biol., 274, 530545.[ISI][Medline]
Sutcliffe,M.J., Haneef,I., Carney,D. and Blundell,T. (1987) Protein Engng, 1, 377384.[Abstract]
Taylor,W.R. (1986) J. Mol. Biol., 188, 233258.[ISI][Medline]
Vogt,G. and Argos,P. (1997) Fold. Des., 2, S40S46.[ISI][Medline]
Williams,A.F. and Barclay,A.N. (1988) Annu. Rev. Immunol., 6, 381405.[ISI][Medline]
Received August 7, 1998; revised February 10, 1999; accepted March 16, 1999.