Department of Mathematics, Rutgers University, Piscataway, NJ 08854, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: immunoglobulin variable domain/protein sequence classification/residue classification/residue prediction in sequences
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In the approach used here, the proposed analysis of sequences is carried out on three levels. At the first level the frequency of residues at each position is calculated. At the second level immunoglobulin chains are divided into fragments corresponding to structural unitsstrands, loops or parts of these secondary structure units. Figure 1 illustrates fragmentation of sequences using the mouse heavy chain XRPC-44 from the Kabat database as an example (Johnson et al., 1996
). The amino acid sequence of a fragment is given the term word. At this level the certain correlation between residues in each word are discovered. At the third level a sequence of amino acids is represented as a sequence of words. This allows us to find all the possible combinations of words in a sequence and therefore the conserved combinations of residues in the chains.
|
At the second level, the focus of the investigation is shifted from individual positions to the analysis of all possible combinations of residues in small sequence fragments, i.e. words. The inspection of residue frequencies showed that the SR and RC positions are occupied by many different residues, but the number of combinations of these residues in a word is very limited. We termed the most frequently occurring combinations of residues at a given position in a word as patterns of the words or keywords.
At the third level, by studying how words combine in sequences, we can uncover correlation among residues throughout the length of the sequence. The representation of a chain as a sequence of words rather than a sequence of amino acids, permits us to suggest different principles of classification than the ones usually adopted. Sequences represented in the word notation naturally divide into several classes. Sequences within a class follow the same amino acid pattern, i.e. residues at most of the identical positions in sequences of a class are, in an overwhelming majority of sequences of that class, identical or chemically related.
Thus the amino acid patterns of sequences show the correlation of residues for the whole sequence. Because a strong correlation of residues was found, one only needs several residues to classify a sequence. There exists simple explicit rules of attributing a sequence to a class. A sequence is assigned to a class if its residues at several (e.g. about 510 residues in the heavy chains of Ig variable domains) key positions are chemically related with residues at respective positions of the pattern sequence of its class.
These investigations result in a biological' classification of protein molecules. Central to any biological classification is the notion of defining characteristics' (in the case of protein sequences, sequence determinants'). Generally, very few characteristics are needed to assign a biological object to its class. In the proposed classification, sequence determinants of a class is a set of positions, residues at which fully determine a correlation of residues in a sequence.
The main principles of this approach of sequence classification in protein families have been described previously (Gelfand and Kister, 1995, 1997
; Chothia et al., 1998
; Galitsky et al., 1998
; Gelfand et al., 1996
, 1998a
, Gelfand et al., b
). In this work, we use this approach to find sequence determinants of the mouse heavy chains of the variable domains. We present here the results of the classification of residue conservation, a collection of patterns for all strands and loops and defining characteristics for the classification of the chains.
We show here that the position-word approach works well for immunoglobulin chains. In the future, we intend to extend our analysis, at first, to the other members of the Ig fold, and then further to the proteins that have different types of fold. From these investigations we expect to discover the principle relations between the sequences and structures of proteins for different fold types.
![]() |
Methods and results |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Two methods of statistical analysis of residue frequencies. We employed in our research two independent methods of statistical analysis of residue frequencies. According to the first, traditional method, residues at identical (same index) positions are picked out from all available sequences and the amino acid distribution for each position is calculated. However, we found that this method of statistical analysis does not give reliable results. This is due to the abundance of identical or nearly identical sequences in the Kabat database.
To overcome this difficulty we suggested an alternative and independent approach to statistical analysis of residue frequencies at a given position (Gelfand and Kister, 1995). First, all words in a set of sequences, which correspond to the same structural unit, are grouped together. Then, from each group of words we selected all the distinct words, those with differing residues in one or more positions. The basic idea behind this method is that residue frequencies are calculated for distinct words only, rather than for all sequences. Using both methods helps one to achieve more reliable statistical conclusions.
Classification of residue positions in mouse heavy chains.
Statistical analysis of residues by both methods allows us to determine residue conservation at each position. In this work, amino acid frequencies at positions in the variable domain were calculated for 2721 available sequences of mouse heavy chains from the Kabat database (Johnson et al., 1996). We present here the results of residue frequency calculations for 94 positions of chains in variable domains. CDRs (complementary determining regions) positions are not considered (Table II
). CDRs residues in these hypervariable regions are involved in antibodyantigen binding. In this work, CDRs positions are found in CB, C'C'', C'' and FG words.
|
Further inspection of residue frequency data revealed 24 positions that are occupied by closely related residues in almost all chains [similar residues (SR) positions] (Table IIB). Seven of them are of SR type with respect to all Ig sequences (Chothia et al., 1998
). As far as their residue nature is concerned, 12 of the 24 SR positions are taken by hydrophobic residues, four by Ser and Thr, three by Gly, Ala and Ser, two by Tyr and Phe, two by positively charged residues (Lys and Arg) and one by a negatively charged (Glu and Asp) residue.
For the other 53 positions in the variable domains of mouse heavy chains, the chemical character of the residues is very diverse. We present here the positions that are occupied by residues from two amino acid groups only (Table IIC). Twenty-six such positions were identified. Moreover, our analysis showed that Serthe most common residue in immunoglobulin sequencescan share positions with very different residues, such as Pro, Lys, Arg, Ile, Leu, Asn and Gln. Another very common residue, Gln, shares positions with positively charged Lys and Arg, with negatively charged Glu and Asp, and with Ser and Thr residues. The residues Lys and Arg share positions not only with Gln, but also with such different residues as Val, Gly, Ala, Ser and Thr.
To summarize, the data of residue frequencies in mouse heavy chains shows that of 94 positions in the sequences of variable domains (CDRs positions excepting), 41 positions are occupied by an invariant residue or conserve closely related residues.
Classification of amino acids in immunoglobulin sequences.
Examination of the residue frequencies shows that the amino acids can be divided into groups based on two criteria. Amino acids in one group must be (i) chemically similar and (ii) usually found at the same position in the sequence. For example, it was found that chemically similar Ser and Thr usually share the same positions. Statistical analysis of the immunoglobulin chains revealed four positions which are only occupied by these residues in almost all chains (Chothia et al., 1998). The analysis revealed as well the positions which are occupied by only hydrophobic residues, positive charged residues, etc. This analysis of amino acids in immunoglobulin chains resulted in a 10-group classification (Table I
). Amino acid classification is a necessary step on the way to uncovering patterns of words and sequences, as will be explained below.
|
Analysis of distinct words from each fragment.
The sequences of the mouse heavy chains of variable domains are divided into 21 words (see Figure 1). Then all the words corresponding to a particular secondary unit are collected together. For example, D words were found in 1931 chains. All 1931 D words were compared with each other, and a subset of distinct words' (words with differing residues in at least one position) were selected from this collection. 159 distinct D words were found.
The number of distinct words varies considerably from one secondary structural unit to another. For example, examination of words describing the E strand295 distinct E words were found. These data illustrate another important point: the number of distinct words is always much smaller than the number of chains containing the fragment (Table III). Exceptions were found only for fragments of CDRs: numbers of distinct CB, C'C'' and C'' words are relatively large, just three to four times less than the number of sequences. The number of distinct FG words is approximately equal to the number of chains, i.e. there are almost no FG words found in more than one sequence. The result of this analysis is a collection of distinct words that constitutes a complete database of strands and loops in mouse heavy chains.
|
|
Inspection of words that belong to one class revealed several positions where it was found that not one, but two or three residues occur frequently. A question can be raised: if there are two or more such positions in words of one class, is there any correlation among residues at these positions? Consider, for instance, A' words of class 4 in the mouse heavy chains (Table IV). Position A'2 is most often occupied by the hydrophobic residues Ile and Leu and position A'3 by Val and Leu residues. However, not all four of the possible combinations of these residues are found: Leu at A'2 combines with Val at A'3 in one set of words, while in another set of words, Ile at A'2 combines with Leu at position A'3. The presence of such a correlation among residues in words suggests the possibility of dividing a class of words into subclasses, each with its characteristic keyword. Two A' keywords of class 4, 41 and 42, emphasize residue combinations that are commonly encountered; each keyword subsumes a large subclass of words. The subclasses of words with a strong correlation of residues were found for almost all strands and loops, except for A, AA', C'C'', C'' and C''D.
Further analysis of words revealed that residues at several positions show no correlation with residues at other positions in a word. Thirteen such positions were found: CB3, CB4, CB5, C2, CC'4, C'5, C'C''1, C'C''2, C'C''3, C'C''4, C''1, C''3 and C''5. Usually residues of three and more amino acid groups occupy these variable (marked by an X) positions (Table IV).
Analysis of words: a résumé
Third level: classification and patterns of sequences
Determination of the main patterns of words of fragments discussed in the preceding section opens up the way for the classification of sequences in a protein family. A sequence can be broken down into words and each word may be assigned to its proper class. As each class of words is characterized by a keyword, an amino acid chain can be encoded as a sequence of keywords. A particular keyword representation may cover a set of chains of a protein family. Such keyword representation is termed a pattern sequence and the set of chains it describes a class of sequences. According to this definition, residues at same-index positions in sequences of a class belong to the same amino acid group as the residue in the respective position of the pattern sequence.
Examination of the mouse heavy chains written in terms of keywords reveals that the sequences can be divided into eight classes. Classes I and III are further subdivided into two subclasses. Pattern sequences for the classes and subclasses are presented as a set of the keywords of all words (Table V). Comparison of pattern sequences of these classes showed that they differ from each other at 14 to 54 positions.
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Now we can address the question posed in the Introduction: is there any interdependence among residues at non-conservative positions? This question was reformulated in terms of positions and words. At first, we calculated the frequency of residues at each position. Then we split this problem into two parts: determination of the correlation between residues (i) within small fragmentsin words, and (ii) far away in the sequencebetween words.
Studies of sequence fragments, words, led to elucidation of possible patterns, or keywords, of each strand and loop. In fact, the possibility of determining patterns of a secondary structure unit is predicated upon the existence of a large number of local correlations. A small minority of positions, however, are quite variablethey can be occupied by residues from several amino acid groups. There was little, if any, correlation among residues in these positions and residues at other positions in the keyword.
To answer the question about interdependence between distant residues along the chains, we analyzed the correlation among keywords. This analysis revealed amino acid patterns of sequences in classes. The patterns represent a correlation between amino acids throughout the length of the sequence. Consider, for example, the pattern of class II sequences. We found that Ile at position 0A2 correlates with, for instance, Lys at A'3, Glu at A'B3, Phe at position D4, as well as specific residues at every position in the sequence, with the exception of position C2.
Classification of positions in mouse heavy chains
The investigation of residue frequency at each position and analysis of the correlation of residues show the organization' of residues in the mouse heavy chains. As was pointed out in the Introduction, calculation of residue frequencies resulted in three kinds of position classification: IR, SR and RC. In order to consider the correlation of residues in sequences, we divided the SR and RC positions into two types: class-determining positions' and variable positions'. Residues at the class-determining positions correlate with each other. A set of residues at these positions differentiates one class of sequences from another. About 60 such class determining positions were found. Residues at variable positions show no correlation with other residues in the sequence. Variable positions may contain different residues even in sequences of the same class. Taking into account that residues at most variable positions are involved in antigen binding, it may be suggested that substitution of residues in these positions has no effect on the Ig-type of folding.
Class-determining positions
Clearly, to assign a query sequence to its proper class, we need to compare residues at class-specific positions of the query and pattern sequences. In fact, we need not know the residues at all class-specific positions in the query sequence. An advantage of our approach is that it allows one to find a small number of the class-determining positions that uniquely determine a class of a sequence. For example, comparison of the pattern sequences leads to the following conclusion: information about the residue content of merely five positions at the beginning of a sequence (0A1, 0A2, 0A3, A2 and A3 positions) is sufficient to determine the class-affiliation of a given mouse heavy chain. Indeed, a set of residues at these five positions is unique for each of the eight classes of sequence (Table V).
Prediction of residues in sequences
An important corollary of this approach is the possibility of predicting residues in an incomplete sequence. Knowledge of residues at several crucial class-determining positions in a query sequence allows us to determine the class to which the sequence is to be assigned and its amino acid pattern. Therefore, knowing the residues at but a few positions, we can predict the amino acid (or, more precisely, the group of close amino acids) at almost all other positions of that sequence. It follows that one can reconstruct the entire sequence from a small sequence fragment containing the class-determining positions.
To check the critical prediction accuracy, we randomly chose 50 full sequences (sequences without any missing residues) from the Kabat database. About 75% of residues were randomly eliminated from each sequence. The reconstruction procedure includes the following steps. Let us consider the sequence fragment PGLVAPSQSISITCTVSG. The alignment of this fragment with all keywords of the mouse heavy chains (Table IV) revealed three words: GLVA, PSQ and SISITTCTV. These words match with the A', A'B and B keywords in the amino acid pattern of class VI (Table V
). Following this pattern we can predict almost all residues in a sequence. After this procedure was successfully tested for the complete sequence, we reconstructed about 300 uncompleted sequences from the Kabat database. For all of these sequences the corresponding amino acid patterns were found unambiguously, and missing residues (or groups of most likely residues) were predicted.
Prediction of germline sequences
It is known that the human immunoglobulin VH locus contains 51 VH segments (Chothia et al., 1992). These segments were classified into seven families on the basis of homology: sequences were said to belong to the same family if they were at least 80% homologous at the nucleotide level. In our previous work, we compared the classification the human heavy chains with the one obtained for human germline VH sequences (Galitsky et al., 1998
). VH segments were compared with the classes determined for the human heavy chains. Remarkably, all germline sequences (except one) naturally fell into the same basic six classes as the Kabat database sequences. On the basis of this result, we hypothesize that the presently unknown mouse VH germline repertoire would also fall into the same classes as were obtained from our analysis of the mouse heavy chains. We suppose that most of mouse VH segments can be described by the amino acids patterns of eight classes and subclasses presented in this paper.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
Dedicated to the memory of Oleg Ptitsyn
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bork,P., Holm,L. and Sander,C. (1994) J. Mol. Biol., 242, 309320.[ISI][Medline]
Chothia,C., Boswell,D.R. and Lesk,A.M. (1988) EMBO J., 7, 37453755.[Abstract]
Chothia,C., Lesk,A.M., Gherardi,E., Tomlinson,I.M., Walter,G., Marks,J.D., Llewelyn,M.B. and Winter,G. (1992) J. Mol. Biol., 227, 799817.[ISI][Medline]
Chothia,C. Gelfand,I.M. and Kister,A.E. (1998) J. Mol. Biol., 278, 457479.[ISI][Medline]
Galitsky,B., Gelfand,I.M. and Kister,A.E. (1998) Proc. Natl Acad. Sci. USA, 95, 51935198.
Gelfand,I.M. and Kister,A.E., (1995) Proc. Natl Acad. Sci. USA, 92, 1088410888.[Abstract]
Gelfand,I.M. and Kister,A.E. (1997) Proc. Natl Acad. Sci. USA, 94, 1256212567.
Gelfand,I.M., Kister,A.E. and Leschiner,D. (1996) Proc. Natl Acad. Sci. USA, 93, 36753678.
Gelfand,I.M., Kister,A.E., Kulikowski,C. and Stoyanov,O. (1998a) J. Comput. Biol., 5, 467477.[ISI][Medline]
Gelfand,I.M., Kister,A.E., Kulikowski,C. and Stoyanov,O. (1998b) Protein Engng, 11, 10151025.[Abstract]
Hankapiller,T. and Hood,L. (1989) Adv. Immunol., 44, 163.[ISI][Medline]
Harpaz,Y. and Chothia,C. (1994) J. Mol. Biol., 238, 528539.[ISI][Medline]
Harris,L. and Bajorath,J. (1995) Protein Sci., 4, 306310
Johnson,G., Kabat,E.A. and Wu,T.T. (1996) Kabat database of sequences of proteins of immunological interest. In Herzenberg,L.A., Weir,W.M., Herzenberg,L.A. and Blackwell,C., Weir's Handbook of Experimental Immunology, Immunochemistry and Molecular Immunology, 5th Edn. Blackwell Science, Cambridge, MA, pp. 6.16.21.
Lesk,A.M. and Chothia.C. (1982) J. Mol. Biol., 160, 325342.[ISI][Medline]
Smith,D.K. and Xue,H. (1997) J. Mol. Biol., 274, 530545.[ISI][Medline]
Taylor,W.R. (1986) J. Mol. Biol., 188, 233258.[ISI][Medline]
Williams,A.F. and Barclay,A.N. (1988) Annu. Rev. Immunol., 6, 381405.[ISI][Medline]
Received March 3, 1999; revised June 5, 1999; accepted June 25, 1999.