The immunoglobulin fold family: sequence analysis and 3D structure comparisons

D.M. Halaby, A. Poupon1 and J.-P. Mornon

Systèmes Moléculaires et Biologie Structurale, LMCP, CNRS UMR C7590 Universités Pierre et Marie Curie (P6) et Denis Diderot (P7), Tour 16,Case 115, 4 Place Jussieu, 75252 Paris cedex 05, France.E-mail: poupon{at}lmcp.jussieu.fr


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Fifty-two 3D structures of Ig-like domains covering the immunoglobulin fold family (IgFF) were compared and classified according to the conservation of their secondary structures. Members of the IgFF are distantly related proteins or evolutionarily unrelated proteins with a similar fold, the Ig fold. In this paper, a multiple structural alignment of the conserved common core is described and the correlation between corresponding sequences is discussed. While the members of the IgFF exhibit wide heterogeneity in terms of tissue and species distribution or functional implications, the 3D structures of these domains are far more conserved than their sequences. We define topologically equivalent residues in the Ig-like domains, describe the hydrophobic common cores and discuss the presence of additional strands. The disulfide bridges, not necessary for the stability of the Ig fold, may have an effect on the compactness of the domains. Based upon sequence and structure analysis, we propose the introduction of two new subtypes (C3 and C4) to the previous classifications, in addition to a new global structural classification. The very low mean sequence identity between subgroups of the IgFF suggests the occurrence of both divergent and convergent evolutionary processes, explaining the wide diversity of the superfamily. Finally, this review suggest that hydrophobic residues constituting the common hydrophobic cores are important clues to explain how highly divergent sequences can adopt a similar fold.

Keywords: comparative study/hydrophobic core/immunoglobulin fold/multiple sequence alignment/protein folding


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
In a previous paper, we highlighted the considerable variety of the immunoglobulin fold family (IgFF) (Halaby and Mornon, 1998Go), which contains all sequences or structures having an Ig-like fold (and not only sequences having detectable similarities with immunoglobulins). In fact, many of the structures compared in this paper have no detectable sequence similarity with each other. Many other authors have explored the sequences (e.g. Jones, 1993Go; Harris and Bajorath, 1995Go; Smith and Xue, 1997Go) or the structures (Taylor, 1986Go; Bork et al., 1994Go; Harpaz and Chothia, 1994Go) of Ig-like domains. Here we focus on the structural features of the immunoglobulin fold which has been identified in proteins without either apparent sequence identity or functional similarity.

Classical Ig-like domains are composed of 7–10 ß-strands, distributed between two sheets with typical topology and connectivity. However, recent structural analyses revealed additional secondary structure elements in this classical scaffold, such as additional strands [SOD, DPA (PapD), RSY] or helices (CTM, HCY) (see Table IGo for nomenclature). In this paper, we report an all-against-all structural comparison of 52 distinct Ig-like domains (1326 pairs), having less than 55% pairwise sequence identity. The structures considered were selected from the PDB (Table IGo). Structural-based sequence analysis and comparison of structural features were performed in order to characterize sequence–structure compatibility. Our observations led us to propose a new structural classification within the IgFF.


View this table:
[in this window]
[in a new window]
 
Table I. Ig-like domains with known 3D structures used in the comparative studya
 

    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
The proteins considered in this comparative study possess one or several Ig-like domains (Table IGo). Secondary structure assignments of the known 3D structures were used to superimpose 52 Ig-like domains found in 34 distinct proteins. Visualization of the structures, distance calculations and observation of the hydrogen bonds were performed using the INSIGHTII 2.3.0 and INSIGHTII 95.0 programs (Biosym, San Diego, CA). A structural phylogenetic tree was built using the program MOLPHY (Saitou and Nei, 1987Go). Solvent accessible surfaces (SAS) were computed using the algorithm of Lee and Richards (Lee and Richards, 1971Go; Richards, 1985Go).

Superimpositions within each group were generated by various automatic programs [COMPOSER (Sutcliffe et al., 1987Go), DALI (Holm and Sanders, 1993Go) and COMPARER (Sali and Blundell, 1990Go)] and checked manually. Protein pairs belonging to different groups were superimposed manually using a pseudo-iterative method comparable to that of Hubbard and Blundell (1987). For this type of comparison the use of automatic programs was impossible because of the differences in the orientation of the two sheets and the presence in some structures of extra strands.


    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Ig-like domains have similar general shapes, but differ significantly in their sizes, owing to high variability of the loops (Figure 1Go). While a classical domain contains about 100 residues (Igs), smaller ones (74–90 residues) have been observed in bacterial Ig-like proteins and in several Ig-related molecules (CD2, CD4). Large decorations within loops, sometimes including extra domains, are found in hemocyanin (238 amino acids), transcription factor NFkB (201 amino acids) and cytochrome f (214 amino acids).



View larger version (59K):
[in this window]
[in a new window]
 
Fig. 1. Exhaustive comparison of 52 domains of the IgFF. The minimal common structural core, formed by about 25% of the domain residues, corresponds to the well superimposed median region. The Ig-like domains have a similar 3D shape, but show wide variability in the length of the loop regions (red). Sheet I is in green, sheet II in blue. The fourth strand of constant domains belongs to sheet I (black) or to sheet II (yellow).

 
Topohydrophobic positions were first studied on a bank of fold families, in which all families contain only homolog proteins of known 3D structure with pairwise identity lower than 55% (Poupon and Mornon, 1998Go). Investigating the PDB, 445 families were constituted, 153 of which contain two or more structures. Only one of these families contains more than 16 members: the immunoglobin superfamily. Consequently, the study of this family appears essential to a better understanding of the relationships between topohydrophobic positions, size and diversity of a considered protein family, but also it brings new information on the IgFF.

3D superimposition and sequence analysis

Structures were compared by finding the optimal superimposition between the pair considered while avoiding a unique reference structure with which all domains would be compared. The definition of a `mean structure' for the IgFF does not make sense because of the great diversity in structure and function. Such problems in structure superimposition have been widely studied (Lesk et al., 1986Go; Godzik et al., 1993Go; Johnson et al., 1993Go; Maiorov and Crippen, 1994Go) and cannot be solved by the existing automatic methods. Variations in the lengths of regular secondary structures (strands) and variable loop regions make the alignment of the whole domains impossible (Figure 1Go). A common core was defined among the 35 structurally equivalent residues, localized in the six strands common to the 52 structures (strands A, B, C, E, F and G). The fourth strand, numbered according to its appearance in the sequence, cannot be aligned for all the domains studied owing to its variable localizations in sheet I (strand D) or in sheet II (strand C'). The variable domains contain both strands C' and D (Figure 2Go) and sometimes a ninth strand C''.



View larger version (115K):
[in this window]
[in a new window]
 
Fig. 2. Multiple structural alignment of the Ig-like sequences. This alignment is based on structural conservation of ß-strands. Most pairwise identities are lower than 25%. Numbers represent the sequences of regular secondary structures. Strand limits are not those found in the 3D structures, but those common to all domains. The domains can be classified into distinct subtypes on the basis of the similarity of their hydrophobic cores. Identical residues are found in positions B3, C5, F3 (black background, in C1, I, V and C2 subtypes), in B5, C1, C5 (Fn3 subtype, red background) and in F1 (black background, in C1, I, V, Fn3). Topohydrophobic positions are indicated for each group in darker color. No sequence signature can be observed for the whole family. However, positions B1 and F1 are occupied by aliphatic and aromatic residues, respectively. Residues A3, B1, B3, C3, E5 and F5 are mostly hydrophobic (see Figure 6AGo). Fragments of sequences in parentheses are those of H domains, where the fourth strand is located between the two sheets.

 
The superimposition of the structural core leads to root mean square deviations (r.m.s.d.) between equivalent C{alpha} of <3.9 Å. The highest values are observed between highly distant domains (such as NCD and SOD or CD4 and DPA) or between domains presenting local divergence of strand conformation, especially in the external strand A'. Figure 3Go illustrates the relations between the r.m.s.d. and the sequence identity for each domain pair. Two interesting regions are surrounded: the first one (I) corresponds to high sequence identity and, as expected, low r.m.s.d. (pairs of Ig domains, ACX-like domains), while the second one (II) corresponds to low sequence identity and low r.m.s.d., illustrating once more the fact that structural similarity is not necessarily related to sequence similarity.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3. Relation between divergence of sequence (% identity) and conservation of structure (r.m.s.d. values) in the 52 compared Ig-like domains (1326 pairs). The interesting regions are circled. Zone I corresponds to high identity and low r.m.s.d. (structural and sequence similarities, domains of C1 and C4 subtypes). Zone II corresponds to low identity and low r.m.s.d. (structural but no sequence similarities).

 
The sequences reveal high divergence in the whole set, as shown in Figure 4Go. As no chemically conserved residues had been detected, it is difficult to propose a consensus sequence from the alignment. In other words, no sequence signature of the Ig fold can be defined. Surprisingly, no sequence identity is observed in the common core for about 2% of the pairs of domains compared. However, sequences can be divided into three subtypes, two with a sequence signature and one that allows residue substitutions but possesses a conserved hydrophobic core.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 4. Sequence identity between the 1326 pairwise compared domains. Sequence identities are frequently low (<25%). Most of the domains present only 5–10% sequence identity.

 
The first sequence signature is observed for the Ig constant and variable domains (Figure 2Go). The second concerns the fibronectin type III domains (Fn3). Conserved residues are found in each internal strand, except strand E. This strand shows the best structural fit among the ß-strands, but no significant sequence identity can be detected. In several positions, only substitutions conserving hydrophobicity or aliphatic/aromatic amino acid balance are allowed. The more striking conservations of amino acid type are those found in positions A3, B1, C3, E5 and F5 (78, 78, 86, 88 and 67% of VILF residues, respectively). Table IIGo summarizes the different hydrophobic cores found in the IgFF.


View this table:
[in this window]
[in a new window]
 
Table II. Structural classification of constant domains of the IgFF: structure and sequence characteristics of each group
 
Some of the positions of a particular fold are always (in all the proteins adopting this fold) occupied by hydrophobic amino acids. These positions were shown to be key markers of the fold (Poupon and Mornon, 1998Go, 1999Go). It has also been demonstrated that the properties of these conserved hydrophobic positions can be enlarged to all the positions occupied by strong hydrophobic amino acids (VILFMYW) in more than 75% of the representatives of the fold and occupied by non-strong loop former amino acids in the remaining representatives (ACTQERK); these positions are called topohydrophobic positions. In the case of the IgFF domains, only position C3 is topohydrophobic for the complete superfamily. A3, B1 and E5 are occupied by strong hydrophobic amino acids in more than 75% of the sequences but are sometimes occupied by amino acids having strong propensities for loops (Callebaut et al., 1997Go) that cannot be integrated in topohydrophobic positions as they were defined. F5 is not topohydrophobic because this position is occupied by strong hydrophobic amino acids in only 67% of the sequences. This result illustrates the great diversity of this super-family.

In order to investigate the similarity between the different groups defined in the IgFF (Figure 2Go), the topohydrophobic positions in each of them were determined (Table IIGo). A score, which depends on the number of topohydrophobic positions in each group (ni for the group i, nj for the group j) and on the number of topohydrophobic positions common to two groups (Ni,j), was defined:


and computed for all possible pairs of groups (Table IIIGo). In each group the number of topohydrophobic positions is close to what is expected for an homogeneous family (7–10% of the total amino acids), except for the groups `Others' (which was already known to be heterogeneous) and V.


View this table:
[in this window]
[in a new window]
 
Table III. Topohydrophobic scoresa
 
Hydrogen bonds

Hydrogen bonds in the the Ig-like domains are mainly conserved in the structural common core (Figure 5Go) (Kabsch and Sander, 1983Go). However, some domains deviate from this general scheme, in that not all possible H-bonds are formed or alternatively are established between non-equivalent residues. In many domains, the external strands A and G present geometrical distorsions known as ß-bulges (Richardson, 1977Go; Chan et al., 1993Go), which lead to an imperfect general H-bond network. Hydrogen bonds have been extensively shown to be important for the stability and dynamics of protein domains (Pfuhl et al., 1997Go; Vogt and Argos, 1997Go) and probably play an important role in Ig domains.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5. General hydrogen bond diagram observed in the Ig-like domains. The hydrogen bonds between strands A and G are not shown, because they are less conserved between domains. Numbers represent residue positions, as shown in Figure 2Go.

 
Burial of conserved hydrophobic residues

Based on the structural alignment a typical hydrophobic core of the Ig fold can be described. Solvent accessibity surfaces (SAS) were calculated for each residue (Figure 6AGo). These calculations show that the structural core can be virtually divided into three parts: buried positions (SAS < 20 Å2), internal positions (20 Å2 < SAS < 50 Å2) and exposed positions (SAS > 50 Å2). The most buried positions are B3, C3 and F3 (SAS < 10 Å2). For strands B, C, E and F, amino acids with side chains pointing towards the interior of the protein have SAS < 20 Å2 and amino acids with side chains pointing towards the exterior of the protein have SAS between 20 and 50 Å2. For other strands (A, C', D and G), SAS of residues oriented towards the interior (between 20 and 50 Å2)is still lower than that of residues oriented towards the exterior (>50 Å2 ), but the values are higher.



View larger version (48K):
[in this window]
[in a new window]
 
Fig. 6. (A) Average of solvent accessibility surfaces (SAS). The mean solvent accessibility surface for each position in each strand was computed using the algorithm of Lee and Richards (Lee and Richards, 1971Go; Richardson, 1977Go). (B) The fourth strand of constant domains. Sheet I is in green, sheet II in blue. The classification of the Ig-like domains partly depends on the occurrence of the fourth strand in sheet I or II: C1 domains (black, in sheet I), C2, C3, C4, S and Fn3 domains (yellow, in sheet II). The subtype H constitutes hybrid form between these subtypes with the fourth strand located between the two sheets (red).

 
These observations lead to internal (B, C, E and F) and external (A, C', D and G) strands being defined. It is interesting that C3 is a topohydrophobic positions, B3 and F3 are topohydrophobic positions in three groups (C3, C4 and Fn3) and form disulfide bridges in two others (C1 and C2).

The hydrophobic core described here is a key signature with an impact on the structural behavior of the Ig fold.

The disulfide bonds

A dominant feature of canonical Ig domains is the disulfide bridge connecting strands B and F. However, in several members of the family, disulfide bonds, when they exist, involve two strands, a strand and a loop or two loops. The disulfide bridge of Ig molecules is invariably located between positions B3 and F3. Only ~0.5% of the sequences of the KABAT bank (Martin, 1996Go) lack these cysteine residues, with consequent loss of their biological activities (Proba et al., 1997Go) in the same manner as Ig mutants with substitutions at the cysteine residues. However, a functional antibody lacking the disulfide bridge has been observed (Rudikoff and Pumphrey, 1986Go).

The usual method for identifying Ig-like domains consists in checking the length of the fragment between the two cysteine residues. Indeed, the number of residues within the disulfide bridge varies significantly as the domain type changes: 63–76 (V), 54–64 (C1), 38–43 (C2) and 44–51 (I) (Smith and Xue, 1997Go).

However, the disulfide bridge is not a common feature of the IgFF. Although expected to be important for the correct and stable folding of an Ig domain, many Ig-like domains lack the B3–F3 disulfide bond, raising the question of its actual involvment in the folding pathway. Furthermore, some proteins, such as TLK, NFK1, NFK2, BGL, CTM and GGT3, exhibit free cysteine residues. The impact of the classical disulfide bridge occurring within Ig-like domains can be evaluated as follows. (a) The absence of the disulfide bridge or the occurrence of a bridge in an atypical position explains the larger separation of the two sheets in comparison with classical domains, as reflected by the distance between the C{alpha} positions B3 and F3 (6.1–7.1 and 7.2–11.7 Å for domains with and without the canonical disulfide bridge, respectively) (Table IIGo). The role of this disulfide bridge in the compactness of the domain has been confirmed by mutagenesis of an Ig-variable domain and demonstrates the ability of the protein to accommodate the absence of the cysteine residues to maintain its fold (Proba et al., 1997Go). (b) In domains lacking the disulfide bridge, the cysteine residue is replaced by a strong hydrophobic residue, the side chains of which maintain the hydrophobic core formation.

In conclusion, the disulfide bond may have more of a functional than a structural role. The absence of this structural constraint in many domains may allow adaptation to specific biological functions or to particular structural features, such as the insertion of additional secondary structure in the domain, and may enhance the assembly of many Ig-like domains such as those of the fibronectin type.

Structural classification of the IgFF

Williams and Barclay (1988) divided classical Ig domains into three topological domain subtypes: C1 (constant 1), C2 (constant 2) and V (variable). The resolution of many structures of Ig-like domains has revealed new topological subtypes including subtype I (intermediate) (Harpaz and Chotia, 1994), S (switched) and H (hybrid) types (Bork et al., 1994Go). The present analysis is in agreement with these studies and extends the comparison to Ig-like domains possessing additional strands, such as the structures of SOD (superoxide dismutase), hemocyanin, DPA (PapD) domain 2 and cytochrome f. The only criterion required is the occurrence in the domain of a topology and connectivity similar to those of immunoglobulins (Halaby and Mornon, 1998Go). Domains that are distant in terms of angles between sheets, twists in some strands or difficult superimposition are also included in our study. The extension of the previous structural classifications to the newly identified structures, combined with a sequence analysis of the Ig-like domains led us to define two new subtypes: C3 (constant 3) and C4 (constant 4) (Table IIGo). The discrimination of these two groups is justified by the differences in sequence (Fn3 and C4 have different hydrophobic cores, the only common feature between them being the presence of a tyrosine in position C1, Fn3 proteins have a tryptophan residue in position B5, an aromatic residue in position C5 and a tyrosine residue in position F1, and none of these is found in proteins of the C4 sub-family), together with structural characteristics (proteins of the C4 sub-family have two conserved disulfide bonds, none of them is found in the proteins of the Fn3 sub-family; proteins of the C4 sub-family have two ß-strands forming a small sheet perpendicular to the two canonical ones, implicated in the active site) (Table IIGo). This discrimination can also be justified by the values obtained for the Si,j parameters: the highest value is obtained when comparing the sub-families I and C1, the discrimination between these two being largely accepted, so it seems reasonable to split Fn3–C4 and Fn3–C3, as in both cases the value obtained is much lower.

The information contained in the structural distance matrix (r.m.s.d. values) is illustrated through hierarchical clustering [using the program MOLPHY (Saitou and Nei, 1987Go)] as shown in Figure 7Go. The distance between the 52 proteins studied, measured by the r.m.s.d. values, is coherent with the classification in subgroups. However, the tree reported in this study was established on the basis of structural similarity and should not be directly compared with trees constructed on the basis of sequence comparison. Cross-comparison of the 52 Ig-like domains reveals a coherent clustering into subclasses, which together with the sequence analysis results in a new classification of Ig-like domains.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 7. Structural tree of the IgFF. Multiple cross-comparisons of the Ig-like domains led to a coherent clustering of the domains into subclasses. Comparison of this classification, based on structural criteria (r.m.s.d. values), with those derived from the sequence analysis led to a new classification of the IgFF as indicated on the right. At the bottom of the tree, the domains NCI, NFK1 form a separate cluster, owing to their particular characteristics (see text). Most proteins of a same cluster have similar functions (C1, C2, V) or unknown functions (C3).

 
An Ig-like domain invariably contains six strands, A, B, C, E, F and G, which constitute the common structural core. Burried amino acids of strands B, C, E and F constitute the common hydrophobic core. Strands A, C', C'', D and G are the external strands. The presence or absence of these strands in a domain, except for strand G, determines its appearance in the C, V, I, S or H sets. The greatest variability occurs in the fourth strand, numbered as it appears in the sequence. This strand belongs to the first sheet (strand D, domain C1) or to the second sheet (strand C', domain S: domains C2, C3, C4). Variable domains contain both strands D and C' (Figure 6BGo). The H domains are hybrid forms between the C and S types, the fourth strand lying between the two sheets. Type I corresponds to domains presenting sequence signatures of the C1 domains (in positions B3, C5, F1 and F3) and structural features of variable domains (number and topology of strands). Table IVGo summarizes the different subtypes described here and their topologies.


View this table:
[in this window]
[in a new window]
 
Table IV. Topology of IgFF subclassesa
 
As the number of distinct subclasses in the IgFF increases, many questions arise, such as how the subtypes are similar or which subtype could be the first domain from which different subclasses may have evolved. Structural and sequence considerations lead us to cluster the different subclasses into similar groups [(((((C1, V) C2) I) Fn3) C4) C3]. The pairs of compared subclasses are clustered in a manner so as to maximize the sequence identity and to minimize the r.m.s.d. values. From left to right, the sequence identity decreases between pair of subclasses and the r.m.s.d. values increase. The score used for the determination of the above classification is defined as follows: for the subgroups G1 and G2, the score S(G1,G2) is


where n1 and n2 are the number of members in subgroups G1 and G2, respectively, with


where r.m.s.(P1,P2) and id(P1,P2) are the root mean square deviation and the sequence identity, respectively, between the two proteins P1 and P2 belonging to groups G1 and G2.

Different hypotheses have been made mainly to explain how the primordial domain might have gained or lost a strand, leading to intermediate structures. Depending on the authors, the original domain might be the V domain (Williams and Barclay, 1988Go) or the C2 domain (Hunkapiller and Hood, 1989Go; Smith and Xue, 1997Go).

Since several Ig-like domains did not cluster with any of the structural sets described above (NCD, CTM, NFK2), additional subclasses of the Ig fold must exist and should be more documented when new 3D structures are solved. The NCD differs from a V domain by the localization of strand A between the two sheets and the absence of hydrogen bonds between strands A and B. The CTM domain presents nine strands as a variable domain, but the connectivity between C and E strands is atypical: the topology of the CTM domain is AA'BCDC'EFG (instead of AA'BCC'DEFG for a variable domain). The second domain of NFK could be described as intermediate between variable domains (same number of strands) and constant domain (a maximum of 14% sequence identity with C4 domains and with bacterial chitinase within the whole superfamily).

Conclusion

In a previous paper, we showed that the immunoglobulin fold family (IgFF) comprises a heterogeneous group of proteins sharing structural similarity but exhibiting a wide range of functions, species and tissue distribution. In this paper, 52 Ig-like domains found in the PDB were compared in order to define and characterize sequence and structural constraints of the Ig fold. The structure-based multiple alignment of the sequences revealed low overall sequence identity (often in the 5–15% range) and no functional relationship. Geometrical features, such as secondary structure, hydrogen bonds, disulfide bridges and solvent exposure, were compared through 1326 pairs of Ig-like domains.

Within the compared Ig-like domains, a few residues form the common core. As a general rule, two sequences which share at least 30% sequence identity are considered to fold very similarly (Chothia and Lesk, 1986Go; Schneider and Sander, 1991Go). The IgFF is remarkable in that most of the Ig-like domains display <10% sequence identity. Many studies have shown that the folding pattern of a protein is dependent not only on its sequence, but also implicitly on its overall amino acid composition (Nakashima et al., 1986Go; Chou, 1989) and that the size of the protein and the percentage of each amino acid can be used to predict the folding type. In the IgFF domains, most of the residues constituting the common core are, as expected, hydrophobic and are concentrated in a small number of conserved positions, probably responsible for maintenance of the Ig fold. Membership in this continually growing structural family requires specific interactions that stabilize the folded domains: (a) the formation of a typical hydrophobic core coded by the sequence; (b) the occurrence of specific tertiary interactions within the hydrophobic core; (c) in several subtypes, the introduction of disulfide bridges which influence the overall domain shape and also the symmetry between the two sheets.

Although these proteins retain a common fold, structural changes occur as their sequences diverge. Residue substitutions do not change the overall appearance of the ß-strands. However, changes in H-bond spacing, twists of strands or in one sheet relative to the other are observed to accommodate the sequence variation. Here we emphasize for a large sample that Ig-like domains have more structural (r.m.s.d. between C{alpha} always <3.9 Å) than sequence similarities (identity mainly <25%). The hydrophobic core probably has a major impact on the uniqueness and stability of the Ig fold. As a general rule, mutations are not disruptive, as we observe a conservation of the properties of amino acids (hydrophobic/hydrophilic) along the alignment. A 29-residue structural core is common to all of the 52 considered domains, defined by the strands B, C, E and F and by six additional residues belonging to strands C' or D. The external strands A and G are more difficult to align owing to irregularities and distorsions in several domains. The ß-bulges occurring in strand A in some domains lead to the appearance of an additional strand A', such as in Ig-variable domains and many domains distantly related to Ig molecules.

Despite the wide sequence variations in Ig-like domains, the maintenance of the Ig fold seems to be enhanced by a conserved geometry of hydrogen bonds. In addition to sequence analysis of the Ig-like domains, the quantitative evaluation of their structural similarity appears to be important to build models for other members of the IgFF, to elucidate Ig folding principles and to predict new members through sensitive sequence comparisons (e.g. Mornon et al., 1997Go).

The Ig-like domains have been identified in various kingdoms including eukaryotes and prokaroytes, bacteria, viruses, fungi and plants [see Halaby and Mornon (1998) for a review]. Some of these domains lack known biological activities, such as those present in bacterial enzymes. The widespread occurrence of the Ig fold and its appearance in plants (Martinez et al., 1994Go) precludes any species or function exclusivity, i.e. the immune response, and raises the question of the origins of the fold. Is the Ig fold derived from a common ancestor, where in some cases the functional activities have been lost during evolution, or is it a stable structure to which many sequences have converged?

Members of the immunoglobin family are known to be phylogenetically related and Gelfand and Kister (1995) showed that there are 47 similar positions in the Ig sequences of the Kabat bank, eight being strictly conserved. Such identity cannot be extended to the IgFF, illustrated by the fact that no strict topohydrophobic positions can be identified for the whole family. Indeed, the study of tophydrophobic positions in the previously defined groups clearly demonstrated the homogeneity within the groups and the heterogeneity between them. Interestingly, the scores computed for each pair of groups in the IgFF and the phylogenetic tree calculated on the basis of sequence identity and r.m.s.d. values correlate well: the pairs of groups which are close to each other in the phylogenetic tree have high scores and those which are distant in the tree correspond to low scores. This result confirms that topohydrophobic positions are indeed related to structural and sequence features.

The determination of topohydrophobic positions being a very recent technique, it is difficult to quantify it accurately. However, the values for Si,j obtained in the present study fit nearly exactly with structural data: values obtained for two subsets of structures belonging to the same sub-family are always higher than 1.5 (data not shown) and consequently always higher than the values obtained by comparing two different sub-families.

The present study cannot definitely answer the difficult question of whether the IgFF evolved by divergent or convergent processes or both mechanisms. Indeed, structural and sequence conservation are high between subfamilies that are functionally correlated, while they are very low and often completely absent in unrelated proteins within the whole superfamily. At such low levels of sequence identity, it is very difficult to distinguish between convergent or divergent mechanisms of evolution (Burkhard, 1997Go). However, it appears more likely that both mechanisms may explain the IgFF: convergence of unrelated domains towards a simple and stable fold and divergence within each subtype.


    Notes
 
1 To whom correspondence should be addressed Back


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Bork,P., Holm,L. and Sander,C. (1994) J. Mol. Biol., 242, 309–320.[ISI][Medline]

Burkhard,R. (1997) Fold. Des., 2, S19–S24.[ISI][Medline]

Callebaut,I., Labesse,G., Durand,P., Poupon,A., Canard,L., Chomilier,J., Henrissat,B. and Mornon,J.-P. (1997) Cell. Mol. Life Sci., 53, 621–645.[ISI][Medline]

Chan,A.W., Hutchinson,E.G., Harris,D. and Thornton,J.M. (1993) Protein Sci., 2, 1574–1590.[Abstract/Free Full Text]

Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, 823–826.[Abstract]

Chou,P.Y. and Fasman,G.D. (1974) Biochemistry, 13, 222–245.[ISI][Medline]

Gelfand,I.M. and Kister,A.E. (1995) Proc. Natl Acad. Sci. USA, 92, 10884–10888.[Abstract]

Godzik,A., Skolnick,J. and Kolinski,A. (1993) Protein Engng, 6, 801–810.[Abstract]

Halaby,D.M. and Mornon,J.P.E. (1998) J. Mol. Evol., 46, 389–400.[ISI][Medline]

Harpaz,Y. and Chothia,C. (1994) J. Mol. Biol., 238, 528–539.[ISI][Medline]

Harris,L. and Bajorath,J. (1995) Protein Sci., 4, 306–310.[Abstract/Free Full Text]

Holm,L. and Sanders,C. (1993) J. Mol. Biol., 233, 123–138.[ISI][Medline]

Hubbard,T.J.P. and Blundell,T.L. (1987) Protein Engng, 1, 159–171.[Abstract]

Hunkapiller,T. and Hood,L. (1989) Adv. Immunol., 44, 1–63.[ISI][Medline]

Johnson,M.S., Overington,J.P. and Blundell,T.L. (1993) J. Mol. Biol., 231, 735–752.[ISI][Medline]

Jones,E.Y. (1993) Curr. Opin. Struct. Biol., 3, 846–852.[ISI]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Lee,B.K. and Richards,F.M. (1971) J. Mol. Biol., 55, 379–400.[ISI][Medline]

Lesk,A.M., Levitt,M. and Chothia,C. (1986) Protein Engng, 1, 77–78.[ISI][Medline]

Maiorov,V.N. and Crippen,G.M. (1994) J. Mol. Biol., 235, 625–634.[ISI][Medline]

Martin,A.C. (1996) Proteins, 25, 130–133.[ISI][Medline]

Martinez,S.E., Huang,D., Szczepaniak,A., Cramer,W.A. and Smith,J.L. (1994) Structure, 2, 95–105.[ISI][Medline]

Mornon,J.-P., Halaby,D., Malfois,M., Durand,P., Callebaut,I. and Tardieu,A. (1997) Int. J. Biol. Macromol., 22, 219–227.[ISI]

Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biol. Chem., 99, 153–162.

Pfuhl,M., Improta,S., Politou,A.S. and Pastore,A. (1997) J. Mol. Biol., 265, 242–256.[ISI][Medline]

Poupon,A. and Mornon,J.-P. (1998) Proteins, 33, 329–342.[ISI][Medline]

Poupon,A. and Mornon,J.-P. (1999) Theor. Chim. Acta, 101, 2–8.

Proba,K., Honegger,A. and Pluckthun,A. (1997) J. Mol. Biol., 265, 161–172.[ISI][Medline]

Richards,F.M. (1985) The calculation of molecular volumes and areas for structures of known geometry. Acad. Press, Inc.

Richardson,J.S. (1977) Nature, 268, 495–500.[ISI][Medline]

Rudikoff,S. and Pumphrey J.G. (1986,)Proc. Natl Acad. Sci. USA, 83, 7875–7878.[Abstract]

Saitou,N. and Nei,M. (1987) Mol. Biol. Evol., 4, 406–425.[Abstract]

Sali,A. and Blundell,T.L. (1990) J. Mol. Biol., 212, 403–428.[ISI][Medline]

Schneider,R. and Sander,C. (1991) Proteins, 9, 56–68.[ISI][Medline]

Smith,D.K. and Xue,H. (1997) J. Mol. Biol., 274, 530–545.[ISI][Medline]

Sutcliffe,M.J., Haneef,I., Carney,D. and Blundell,T. (1987) Protein Engng, 1, 377—384.[Abstract]

Taylor,W.R. (1986) J. Mol. Biol., 188, 233–258.[ISI][Medline]

Vogt,G. and Argos,P. (1997) Fold. Des., 2, S40–S46.[ISI][Medline]

Williams,A.F. and Barclay,A.N. (1988) Annu. Rev. Immunol., 6, 381–405.[ISI][Medline]

Received August 7, 1998; revised February 10, 1999; accepted March 16, 1999.