Department of Chemistry, Rutgers University, Wright-Rieman Laboratories, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: minimal alphabet/protein fold recognition/sequence alignment
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Taken together, these experimental results suggest that there may exist reduced amino acid alphabets which could be used to fold many proteins by making appropriate substitutions in the original sequences. It is difficult to test this hypothesis directly on a sufficiently large number of proteins representative of the known folding families, but we can obtain insights into this problem by studying sequence patterns which characterize these protein folding families, and the effects of alphabet reduction on these sequence patterns.
A central problem of structural genomics is to predict the folding family given a newly sequenced gene. In many instances this can be accomplished by aligning the `query' sequence against a database of sequences which have been clustered into folding families according to structural criteria. We have analyzed how this sequence-based fold recognition procedure depends on the size of the amino acid alphabet from which the sequences were constructed. As discussed below, we observe the fold recognition is minimally degraded when the amino acid alphabet is reduced from 20 to 10 letters by appropriately grouping chemically similar amino acids, but the sequence-encoded information needed to differentiate folding families is rapidly degraded when further reductions in the alphabet are made.
The analogy between protein folding and protein fold recognition is, of course, incomplete. A sequence which has no detectable homology within a clustered database of proteins may well fold to a structure whose family is represented in the database divergent evolution has produced many such examples. In this sense an estimate of the minimal alphabet size based on homology detection by sequence alignments represents an upper bound. Conversely, there is no guarantee that a synthetic sequence will actually fold even if its alignment scores against sequences in a particular family are very much larger than can be expected by chance there could be kinetic barriers to the folding of such sequences, for example. Yet it is intriguing to suggest that if it is possible to construct representative sequences from a large number of different folding families which will fold using a reduced set of amino acids (a reduced alphabet), then the alignment scores of the corresponding sequence pairs will be very much larger than expected by chance. That is to say, if the different sequence patterns that encode for a diversity of folding families can be preserved when the sequences are synthesized using reduced alphabets, then it should be possible to probe this relationship by carrying out computer alignment experiments on sequences constructed with reduced alphabets and comparing these results with those for the parent native sequences.
In this work we evaluate the extent to which sequence patterns derived from reduced alphabets preserve the information needed to detect homologs in a clustered database. The amino acid reduction scheme is based on the analysis of correlations among similarity matrix elements used for sequence alignments. We find that as the alphabet size is reduced, the information encoded in the amino acid sequences responsible for protein fold recognition is degraded. We estimate that for proteins of many different families a minimal alphabet requires 1012 letters.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The 20-letter amino acid alphabet is reduced to smaller alphabets based on correlations indicated by the BLOSUM50 similarity matrix, i.e. amino acid pairs with high similarity scores are grouped together. The procedure for grouping like amino acids together is as follows: first, the correlation coefficients between similarity matrix elements are calculated for all pairs of amino acids, i.e. for alanine (A) and valine (V) the coefficient would be evaluated as
![]() |
Reduction of the similarity matrix based on the groupings is performed by calculating new matrix elements as the average of the appropriate old similarity matrix elements. For example, the score between a group consisting of (A) and one consisting of (ST) is computed as the average of the AS and AT terms. Thus whereas in the original similarity matrix an alignment of A with S contributes MAS to the overall alignment score, in the new similarity matrix the contribution is M12 =(MAS + MAT + MSA + MTA)/4, and alignment of A with T is equivalent to that of A with S.
Alphabet reductions derived from the similarity matrix are shown in Figure 1. The complete group of reduced alphabets studied in addition to those delineated in the figure are as follows: 3 letters, [(LASGVTIPMC), (EKRDNQH), (FYW)]; 5 letters, [(LVIMC), (ASGTP), (FYW), (EDNQ), (KRH)]; 6 letters, [(LVIM), (ASGT), (PHC), (FYW), (EDNQ), (KR)]; 12 letters, [(LVIM), (C), (A), (G), (ST), (P), (FY), (W), (EQ), (DN), (KR), (H)]; and 18 letters, [(LM), (VI), (C), (A), (G), (S), (T), (P), (F), (Y), (W), (E), (D), (N), (Q), (K), (R), (H)]. These groupings are similar to those previously proposed from examining amino acid side-chain properties (Miyata et al., 1979
; Santibanez and Rohde, 1987
) and other similarity matrices (Collins and Coulson, 1987
; Risler et al., 1988
; Landes and Risler, 1994
).
|
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this work we formulate an amino acid reduction scheme based on the analysis of correlations among similarity matrix elements used for sequence alignments. Our procedure averages the matrix elements of the most closely related residues, and constructs reduced similarity matrices using these average values. Alphabet reductions derived from the underlying similarity matrix as explained in the Materials and methods section are shown in Figure 1.
This reduction of the amino acid alphabet followed several clearly recognizable paths, i.e. initially residues with similar physical/chemical properties are grouped together; large hydrophobes (LVIM), amino acids with large and mainly hydrophobic aromatic side chains (FY[W]) and long-chain positively charged residues (KR). The 10-letter alphabet of Figure 1 contains five amino acid groups, including the three groups mentioned plus two additional hydrophilic groups, alcohols (ST) and charged/polar residues (EDNQ). Further reductions of the 20-letter code coalesce smaller residues, and ultimately the code reduces to two basic groups, hydrophobic and hydrophilic. The initial divisions of the reduced alphabets are similar to the five-letter code of Riddle et al. (1997), i.e. the I, K, E, A, G alphabet found for the ß-barrel-like protein. Down to and including the level of reduction corresponding to the 10-letter alphabet in Figure 1
, these amino acids are maintained in separate groups. Thus the specific example of the reduced alphabet that Riddle et al. determined for SH3 is consistent with the proposed simplification scheme through 10 letters, but not below 10 because of AG pairing at the eight-letter level. Figure 1
serves as a guide to the construction of reduced alphabets which may be useful for correlating sequence patterns and folding patterns in a statistical sense as observed across a large number of folding families rather than as a recipe for constructing a particular protein using a very small alphabet.
Having constructed similarity matrices corresponding to reduced alphabets, we proceed to evaluate the extent to which sequence patterns derived from the reduced alphabets preserve the patterns needed to detect homologs in a clustered database. For this analysis we use the SCOP40 clustered database (Murzin et al., 1995; Brenner et al., 1998
), containing 1323 proteins assigned to 639 folding families; no two homologous sequences share more than 40% sequence identity in this database. Coverage versus errors per query (EPQ) plots (Brenner et al., 1998
) used to assess sequence-based methods for homology detection provide the context for our analysis of the effect of reduced alphabets on fold recognition. Figure 2
(inset) shows the coverage versus EPQ plot for the SCOP40 database constructed using the complete 20-letter amino acid code. The coverage is defined as the fraction of homologous sequence pairs that have alignment scores above a threshold, while the EPQ is defined for the same threshold as the total number of non-homologous proteins with alignment scores above the threshold divided by the total number of queries made. As shown in Figure 2
(inset), there is, for example, a 20% coverage of SCOP40 at an EPQ of 0.1 when making use of the full information content of the 20-letter code. This means that we can detect approximately 20% of the true homologs in the SCOP40 database by sequence comparison with a 10% error rate (i.e. the alignment score threshold is set to a value such that 90% of the aligned pairs with scores greater than this threshold are homologous).
The effects of alphabet reduction on protein fold recognition were tested in the following way. The similarity matrices for 10 sets of increasingly reduced alphabets obtained by grouping the amino acids as shown in Figure 1 were assembled. For each of these reduced alphabets, all-against-all sequence alignments were performed using the SCOP40 database. Coverage versus EPQ plots were evaluated for each of the datasets. As the alphabet size is reduced, the information encoded in the amino acid sequences that is responsible for the protein fold recognition is lost. One way to characterize this is to compare the coverages of SCOP40 at a chosen error rate using the different reduced alphabets. The fractional coverage retained relative to the 20-letter alphabet at an EPQ value of 0.001 is shown in Figure 2
. There is a strong non-linear dependence of the fold recognition (the coverage) on the number of amino acids in the alphabet from which the similarity matrices (and thus the sequences) were constructed. As the alphabet is reduced from 20 letters to 12 or 10, the percentage coverage retained is reduced by only ~10%; further reduction of the alphabet is accompanied by a steep loss of fold recognition. With a four-letter alphabet, the coverage of the SCOP40 database at an EPQ of 0.001 is reduced by 90% relative to the complete 20-letter code. When the alphabet is reduced to two types of residues, hydrophilic and hydrophobic, there is no detectable fold recognition.
The results shown in Figure 2 correspond to the analysis of fold recognition using the entire SCOP40 database (Murzin et al., 1995
; Brenner et al., 1998
). We also have examined the effects of alphabet reduction on fold recognition using subsets of this database corresponding to five major fold categories of the SCOP classification scheme (Murzin et al., 1995
). We did not detect a strong dependence of the results on fold type; hence this analysis does not support the suggestion that ß-sheet containing proteins are less tolerant to an amino acid reduction than
-helical proteins (Riddle et al., 1997
). However, effects due to the differences in size and distribution of sequences within families among the different folding classes could obscure differences in the dependence of the coverage-EPQ plots on the alphabet.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Theoretical considerations concerning the folding of heteropolymers indicate that a certain minimum complexity in the polymeric building blocks is required for folding on both kinetic and thermodynamic grounds (Bryngelson et al., 1995; Wolynes et al., 1995
; Hinds and Levitt, 1996
; Klimov and Thirumalai, 1996
; Shaknovich, 1996
; Dill and Chan, 1997
; Wolynes, 1997
). From the results of simplified lattice model simulations, it has been postulated that a minimum of three different amino acid types is required for protein folding, However, the true minimal alphabet may well require additional complexity for the creation of the large number of protein fold types of the kind actually observed in nature. Although the present studies do not address the physics of the problem in the way that lattice simulations are designed to do, this analysis of simplified amino acid alphabets required for protein fold recognition does have implications for the protein folding problem, particularly with regard to the relationship between reduced alphabets and the diversity of fold types. In the context of protein fold recognition by sequence alignment, we find that sequences constructed from 10-letter alphabets obtained by grouping amino acids appropriately contain nearly as much information as the natural sequences do.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Brenner,S.E., Chothia,C. and Hubbard,J.P. (1998) Proc. Natl Acad. Sci. USA, 95, 60736078.
Bryngelson,J.D., Onuchic,J.N., Socci,N.D. and Wolynes,P.G. (1995) Proteins: Struct. Funct. Genet., 21, 167195.[ISI][Medline]
Collins,J.F. and Coulson,A.F.W. (1987) In Bishop,M.J. and Rawlings,C.J. (eds), Nucleic Acid and Protein Sequence Analysis; a Practical Approach, Vol. 3. IRL Press, Washington, D.C, pp. 323358.
Davidson,A.R., Lumb,K.J. and Sauer,R.T. (1995) Nature Struct. Biol., 2, 856864.[ISI][Medline]
Dill,K.A. and Chan,H.S. (1997) Nature Struct. Biol., 4, 1019.[ISI][Medline]
Feng,S., Chen,J.K., Yu,H., Simon,J.A. and Schreiber,S.L. (1994) Science, 266, 12411247.[ISI][Medline]
Heinz,D.W., Baase,W.A. and Matthews,B.W. (1992) Proc. Natl Acad. Sci. USA, 89, 37513755.[Abstract]
Henikoff,S. and Henikoff,J.G. (1992) Proc. Natl Acad. Sci. USA, 89, 1091510919.[Abstract]
Hinds,D.A. and Levitt,M. (1996) J. Mol. Biol., 258, 201209.[ISI][Medline]
Kamtekar,S., Schiffer,J.M., Babik,J.M. and Hecht,M.H. (1993) Science, 262, 16801685.[ISI][Medline]
Klimov,D.K. and Thirumalai,D. (1996) Proteins: Struct. Funct. Genet., 26, 411441.[ISI][Medline]
Landes,C. and Risler,J. (1994) CABIOS, 10, 453454.[Abstract]
Levinthal,C., Wodak,S.J., Kahn,P. and Dadivanian,A.K. (1975) Proc. Natl Acad. Sci. USA, 72, 13301334.[Abstract]
Matthews,C.R. (1993) Annu. Rev. Biochem., 62, 139160.[ISI][Medline]
Miyata,T., Miyazawa,S. and Yasunaga,T. (1979) J. Mol. Evol., 12, 219236.[ISI][Medline]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[ISI][Medline]
Myers,E.W. and Miller,W. (1988) CABIOS, 4, 1117.[Abstract]
Pearson,W.R. (1990) Methods Enzymol., 183, 6398.[ISI][Medline]
Plaxco,K.W., Riddle,D.S., Grantcharova,V. and Baker,D. (1998) Curr. Opin. Struct. Biol., 8, 8085.[ISI][Medline]
Regan,L. and Delgrado,W.F. (1988) Science, 241, 976978.[ISI][Medline]
Riddle,D.S., Santiago,J.V., Bray-Hall,S.T., Doshi,N., Grantcharova,V.P., Yi,Q. and Baker,D. (1997) Nature Struct. Biol., 4, 805809.[ISI][Medline]
Risler,J.L., Delorme,M.O., Delacroix,H. and Henaut,A. (1988) J. Mol. Biol., 204, 10191029.[ISI][Medline]
Sander,C. and Schulz,G.E. (1979) J. Mol. Evol., 13, 245252.[ISI][Medline]
Santibanez,M. and Rohde,K. (1987) CABIOS, 3, 111114.[Abstract]
Shaknovich,E.I. (1996) Folding Des., 1, R50R54.[ISI][Medline]
Wolynes,P.G. (1997) Nature Struct. Biol., 4, 871874.[ISI][Medline]
Wolynes,P.G., Onuchic,J.N. and Thirumalai,D. (1995) Science, 267, 16191620.[ISI][Medline]
Received June 1, 1999; revised November 11, 1999; accepted November 29, 1999.