Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow 117984, Russia
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: : amino acid composition/nucleotide composition/protein structure/protein termini/statistical analysis
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The carboxyl (C) termini of proteins possess some distinct properties: (i) at the ribosomal P-site, they neighbor the tRNA moiety in polypeptidyl-tRNA at the terminal step of protein synthesis and by interacting with it they may modulate the efficiency of translation termination; further, the C-terminal amino acid residues may also interact with class I polypeptide chain release factors (RF1, RF2 or eRF1), which occupy the ribosomal A-site at translation termination (Mottagui-Tabar et al., 1994; Björnsson et al., 1996
; Nakamura et al., 1996
; Tate and Mannering, 1996
; Buckingham et al., 1997
; Drugeon et al., 1997
); (ii) if it is known for the given protein that it folds co-translationally, the nascent C-terminus may interact with the already formed part of the protein globule (Fedorov et al., 1992
; Fedorov and Baldwin, 1995
; Hardesty et al., 1995
; Kolb et al., 1995
; Brunak and Engelbrecht, 1996
); (iii) in general, one may assume that the C-end may serve as a lock stabilizing the spatial protein structure, as for example has already been shown for collagen (Prockop and Kivirikko, 1995
).
The non-random occurrence of certain codons in the last (1) sense codon position and certain amino acid residues at the C-terminus is well known (Buckingham et al., 1990; Brown et al., 1990a
,b
, 1993
; Kopelowitz et al., 1992
; Arkov et al., 1993
, 1995
; Alff-Steinberger and Epstein, 1994
). The modulation of translation termination efficiency by the two C-terminal amino acids of the nascent polypeptide chain [positions (1) and (2)] has recently been revealed (Mottagui-Tabar et al., 1994
; Björnsson et al., 1996
). These data raise an important issue regarding this C-terminal bias in terms of the polypeptide length. If the last two amino acids affect the termination process, only the (1) and (2) positions might be biased, leaving the amino acid composition of other C-terminal positions random. However, if the bias extends beyond the last two positions, it may indicate that the amino acid composition of the C-terminal fragments is governed by other factors rather than by requirements of the translation termination machinery.
The above considerations prompted us to examine the extended N- and C-terminal regions of proteins and the 3' ends of the coding DNA sequences preceding the stop codons by means of an exhaustive statistical analysis. The compulsory prerequisite for application of such an approach was a preliminary cleansing of the protein and nucleic acid databases and a sufficient total number of available sequences to ensure a high significance level.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Continuous protein sequences and the C-terminal peptides longer than 50 residues were extracted from the SWISS-PROT database, according to Sequence specification in the feature table. In total, 123 Escherichia coli C-terminal protein sequences and peptides (30 966 amino acid residues) were analyzed. The set of 516 mammalian proteins (129 745 amino acid residues) was also considered. E.coli and Homo sapiens coding sequences longer than 150 base pairs were taken from the EMBL database, viewing the feature table according to Coding Sequence (CDS) specification. The existence of either complete coding sequences or long 5' contexts to the stop codons was the prerequisite for introducing the CDS to the set. In addition, preservation of open reading frames was also monitored. All sequences under investigation included an integral number of codons. All duplicates of the sequences were excluded from further consideration. The main goal of our work was the analysis of local peculiarities of terminal regions of the sequences and the reproducibility of the obtained peculiarities in different sequences. We fulfilled the procedure for excluding sequences with high and average similarity. Therefore, we performed the cleaning procedure in three steps. First, the sequences with more than 60% identity of the 3'-terminal 50 codons were rejected. Second, the sequences with more than 50% identity in the last 30 codons were rejected. Finally, for the last 10 codons the sequences with more than 60% identity were also rejected. This procedure decreases similarity at the terminal regions of sequences. After database editing, the sets of unique coding sequences for E.coli and H.sapiens comprised 2003 sequences (668 374 codons) and 3640 sequences (1 772 364 codons), respectively. These two sets were further subdivided into three subsets for UAA, UGA and UAG stop codons; for E.coli, 1247 sequences (420 381 codons), 601 sequences (197 237 codons) and 155 sequences (50 753 codons), respectively; for humans, 1132 sequences (545 276 codons), 1729 sequences (827 012 codons) and 779 sequences (400 076 codons), respectively.
Protein spatial structure database
Proteins with known 3-D structure were taken from the protein data base (PDB). In total, 486 unique amino acid sequences were considered, composed of 116 298 amino acid residues. In some cases (41) when the N-terminal amino acids were not resolved by X-ray analysis and were not present in the PDB, they were taken from ENTREZ to complete the protein sequence from the N-termini.
Statistical analysis
The expected (Exp) frequencies of amino acid residues were calculated from the average residue usage in the above-mentioned sets of amino acid sequences. The expected frequency for an amino acid residue of type A at position i will be Exp = (NA/N)M, where NA = total number of amino acid residues of type A in the analyzed set of sequences, excluding position i, N = total number of all amino acid residues in the analyzed set of sequences, excluding position i and M = total number of sequences, i.e. the sum of ith positions in the analyzed set of sequences. The expected frequencies for codons were calculated similarly.
For each amino acid residue (codon) at a given position, the deviation of the observed (Obs) values from the Exp values was estimated by the 2 criterion according to the formula (Obs Exp)2/Exp. For each residue or codon, the
2 value was estimated separately with one degree of freedom. The sums of all 20 (61)
2 values for each residue (codon) at the given position gave the total deviation for the given position with 19 (60) degrees of freedom. To evaluate the range of differences between the C-terminal regions and the neighboring fragments, a pairwise comparison between them was performed. For this purpose, each position in the sequence was treated as a set containing 20 groups of data and the difference between them was calculated by the
2 criterion using the following formula (Borovkov, 1984
):
|
where mi and ni are frequencies of amino acid residues in the two positions of the sequence under comparison, M and N are total numbers of amino acid residues in the compared positions and K is equal to 20 because each position may be occupied by any of 20 different amino acids. At the significance level <0.001, Obs was considered to be different from Exp if the 2 exceeded 10.8, 43.8 and 99.6 for one, 19 and 60 degrees of freedom, respectively.
2 was not calculated for Exp
2.
The programs used in the database editing and the subsequent statistical analysis were written in Borland C and run on an IBM/PC Pentium-100 computer.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For many proteins there is no complete coincidence between the coding sequence at genomic or mRNA levels and the amino acid sequence of the mature proteins. This holds true for the polypeptide termini where post-translational processing may significantly alter the N ends, as mentioned in the Introduction. The cellular carboxypeptidases may affect the composition of the C end in mature proteins. For these reasons, we first analyzed the N and C termini of mature proteins with known 3-D structure taken from the PDB (Table I).
|
It should be noted that the protein set deposited in the PDB is certainly non-random. It is composed of the most abundant and stable proteins whereas proteins with short half-lives (e.g. heat shock proteins) or rare proteins (e.g. transcription factors) are not represented in the PDB. For this reason, the bias documented in Table I is typical for a certain set of proteins, namely for the most abundant and stable molecules.
The C-ends in the protein set with known 3-D structures are also biased (Table I): Lys and Cys are over-represented whereas Thr is under-represented.
From the data summarized in Table I, we conclude that both termini of the polypeptide chains of the most abundant and stable proteins in various groups of organisms are non-random. The main limitation of this conclusion derives from the analyzed set of proteins, which is not sufficient for exhaustive statistical analysis and at the same time the set is not representative because only crystallizable proteins were analyzed. Further, the PDB set is composed of proteins of both prokaryotes and eukaryotes and if these groups possess different biases at both termini they are not recognized if only PDB sequences are considered.
For all these reasons, we continued to analyze the C-terminal bias by taking into account many more protein and nucleic acid sequences from databanks.
Statistical bias for the E.coli and mammalian C-terminal sequences
We found a bias in the C-terminal amino acid frequencies for both E.coli and mammalian polypeptide chains. Even for a relatively limited set of protein sequences from E.coli, it was possible to consider over-representation of Lys in the (1) position (2 = 28.1; Obs and Exp values for this position are 22 and 7.5, respectively), confirming at the protein level the over-representation of Lys codons in E.coli demonstrated earlier (Brown et al., 1990a
,b
, 1993
; Kopelowitz et al., 1992
; Arkov et al., 1993
; Alff-Steinberger and Epstein, 1994
). Position (2), Obs = 15, Exp = 7.5, and position (3), Obs = 16, Exp = 7.5, also prefer Lys residues over the expected values calculated from the average frequencies of amino acid residues in the set of 123 protein sequences containing 30 966 residues. For mammalian proteins (Table II
) it was found that there were prominent peculiarities in the C-terminal sequences: Lys and Cys residues were over-represented, in agreement with the Lys codon over-representation in H.sapiens (Arkov et al., 1995
). However, more refined analysis was hindered by the insufficient total number of well characterized protein sequences: for statistical analysis, it is critically important to deal with large sets of sequences to obtain reliable results. Therefore, we had to extend our analysis to the coding regions of genes, because in this case the sets available after appropriate cleansing (see Materials and methods) were much more numerous and allowed high-fidelity analysis.
|
The frequencies for the over- and under-represented 3'-terminal codons for the E.coli and human sequences are presented in Table III. The
2 values demonstrated the bias from the average frequencies estimated by the analysis of the codon context database (see Materials and methods). The codon distribution (2003 sequences from E.coli) differed significantly (P < 0.001) in the 3'-terminal coding region, mainly from the (1) to (8) positions, counting from the stop codon. The over- or under-representation of one or more codons for charged amino acid residues was observed from the (1) to (8) positions. The preferred codons for the C-terminal octamer were those for Lys and Arg. Strongly under-represented were codons for Thr (two out of four), Met (AUG) and Gly (GGC and GGU). For H.sapiens (3640 sequences), the codon bias covered the C-terminal nonamer versus the octamer for E.coli codons. The majority of the different codon types as biased at the (1) position, where seven amino acids were over-represented, as opposed to five residues noted earlier (Arkov et al., 1995
). In the (4) position the Cys residue was over-represented.
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The general similarity between the amino acid composition of the protein C ends calculated from the PDB database of spatial structures and the amino acid composition of proteins deduced from their coding nucleotide sequences provides indirect evidence that carboxypeptidases are not involved in processing of a considerable part of C-termini. In other words, we assume that the majority of protein C ends in the cell are truncated in the course of maturation and/or proteolysis.
Most proteins have a quaternary organization ensured by intersubunit interactions. The positively charged C end of one subunit may interact with the negatively charged internal amino acid cluster belonging to the other subunit of the same protein molecule. Since most of the C ends have a surface location, one may anticipate that these fragments could serve as substrates for post-translational modification(s) (acetylation, methylation, etc.). Moreover, positively charged C-terminal clusters on the globular surface may govern the binding to other macromolecules charged negatively, e.g. nucleic acids.
The observed bias at the N and C ends may be considered in the frame of the general concepts of protein globule formation and stabilization (Cantor and Schimmel, 1980; Creighton, 1993
). Indeed, protein folding is often initiated at the N end of the polypeptide chain (Fedorov et al., 1992
; Fedorov and Baldwin, 1995
; Hardesty et al., 1995
; Kolb et al., 1995
). The C-terminal octa/nonamers with the bias in their amino acid composition shown in this work should better fix their positions at the globular surface via non-covalent and covalent interactions. Ion pairs formed at the surface contribute to the stability of protein 3-D structure while salt bridges inside the globule tend to destabilize it (Barlow and Thornton, 1983
; Dill, 1990
; Horovitz et al., 1990
; Serrano et al., Sali et al., 1991; Hendsch and Tidor, 1994
; Starich et al., 1996
). In addition to salt bridges, H-bonds may be involved in positioning of the C termini on the globular surface. Moreover, the covalent cross-links between Cys residues (SS bridges) could also be involved in stabilization of the protein surface structure.
The C-terminal octa/nonamer bias is considered to have little effect on translation termination. It is known that the (3) amino acid position has a very weak influence on translation termination in E.coli at the UGA stop codon wherease the C-terminal dipeptide affects the termination efficiency (Mottagui-Tabar et al., 1994; Björnsson et al., 1996
), but this effect is not related to the strong bias at the last two positions. For example, as shown in Table IV
, the (2) position in E.coli proteins has no bias at all towards three termination codons whereas basic amino acids ensure efficient termination at UGA versus acidic residues that are inefficient (Björnsson et al., 1996
). For the (1) position, many amino acid residues are favorable for efficient termination in E.coli at UGA (Björnsson et al., 1996
), although in this position only polar (charged) amino acids are over-represented (Table IV
).
In the present work, neither protein nor nucleic acid sets were subdivided depending on the protein abundance, size, subunit or domain composition, isoelectric point, globular or filamentous shape, loop regions, etc. At the moment, the total number of available sequences seems to be insufficient for such a kind of refined statistical analysis of protein families. However, in the near future, when many more sequences become available such analysis may reveal more sharp biases for certain groups of proteins. On the other hand, it may appear that some of the protein group(s) is (are) random at their C-termini (for instance, filamentous or membrane proteins, where the role of C ends is probably less critical for the maintenance of the overall protein structure).
In conclusion, we explain the bias at the protein termini shown in this work by: (i) putative role of N- and C-termini in protein spatial structure (Thornton and Sibanda, 1983; Christopher and Baldwin, 1996
); (ii) involvement in maintenance of the quaternary protein structure; (iii) specific targeting to certain cell compartments and/or a substrate for certain types of post-translational modification(s) and/or degradation; (iv) modulation of the translation initiation and termination by the marginal amino acids (Varshavsky,1992, 1996
; Mottagui-Tabar et al., 1994
; Björnsson et al., 1996
; Nakamura et al., 1996
; Tate and Mannering, 1996
). It is evident that the bias of protein ends is one of the constraints which one should take into account in the design of new protein molecules or in engineering of the existing proteins.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arkov,A.L., Korolev,S.V. and Kisselev,L.L. (1993) Nucleic Acids Res., 21, 28912897.[Abstract]
Arkov,A.L., Korolev,S.V. and Kisselev,L.L. (1995) Nucleic Acids Res., 23, 47124716.[Abstract]
Bachmain,A., Finley,D. and Varshavsky,A. (1986) Science, 234, 179186.[ISI][Medline]
Barlow,D.J. and Thornton,J.M. (1983) J. Mol. Biol., 168, 867885.[ISI][Medline]
Berezovsky,I.N., Kilosanidze,G.T., Tumanyan,V.G. and Kisselev,L.L. (1996) Folding Des., 1, Supplement, 910.
Berezovsky,I.N., Kilosanidze,G.T., Tumanyan,V.G. and Kisselev,L. (1997) FEBS Lett., 404, 140142.[ISI][Medline]
Björnsson,A., Mottagui-Tabar,S. and Isaksson,L.A. (1996) EMBO J., 15, 101109.
Borovkov, A.A. (1984) Mathematical Statistics. Additional Chapters. Nauka, Moscow, pp. 1315.
Brown,C.M., Stockwell,P.A., Trotman,C.N.A. and Tate,W.P. (1990a) Nucleic Acids Res., 18, 20792086.[Abstract]
Brown,C.M., Stockwell,P.A., Trotman,C.N.A. and Tate,W.P. (1990b) Nucleic Acids Res., 18, 63396345.[Abstract]
Brown,C.M., Dalphin,M.E., Stockwell,P.A. and Tate,W.P. (1993) Nucleic Acids Res., 21, 31193123.[Abstract]
Brunak,S. and Engelbrecht,J. (1996) Proteins, 25, 237252.[ISI][Medline]
Buckingham,R.H., Sörensen,P., Pagel,F.T., Hijazi,K.A., Mims,B.H., Brechemier-Baey,D. and Murgola,E.J. (1990) Biochim. Biophys. Acta, 1050, 259260.[ISI][Medline]
Buckingham,R.H., Grentzmann,G. and Kisselev,L. (1997) Mol. Microbiol., 24, 449456.[ISI][Medline]
Cantor,C.R. and Schimmel,P.R. (1980) Biophysical Chemistry. Freeman, San Francisco.
Christopher, J.A. and Baldwin, T.O. (1996) J. Mol. Biol., 257, 175187.[ISI][Medline]
Creighton,T.E. (1993) Protein Structures. Freeman, San Francisco.
Dill,K.A. (1990) Biochemistry, 29, 71337155.[ISI][Medline]
Drugeon,G., Jean-Jean,O., Frolova,L., Le Goff,X., Philippe,M., Kisselev,L. and Haenni,A.-L. (1997) Nucleic Acids Res., 25, 22542258.
Fedorov,A.N. and Baldwin,T.O. (1995) Proc. Natl Acad. Sci. USA, 92, 12271231.[Abstract]
Fedorov,A.N., Friguet,B., Djavadi-Ohaniance,L., Alakhov,Y.B. and Goldberg, M.E. (1992) J. Mol. Biol., 228, 351358.[ISI][Medline]
Grigoryev,S., Stewart,A.E., Kwon,Y.T., Arfin,S.M., Bradshaw,R.A., Jenkins, N.A., Copeland,N.G. and Varshavsky,A. (1996) J. Biol. Chem., 271, 2852128532.
Hardesty,B., Kudlicki,W., Odom,O., Zhang,T., McCarthy,D. and Kramer,G. (1995) Biochem. Cell Biol., 73, 11991207.[ISI][Medline]
Hendsch,Z.S. and Tidor,B. (1994) Protein Sci., 3, 211226.
Horovitz,A., Serrano,L., Avron,B., Bycroft,M. and Fersht,A.R. (1990) J. Mol. Biol., 216, 10311044.[ISI][Medline]
Kolb,V.A., Makeyev,V., Kommer,A. and Spirin,A. (1995) Biochem. Cell Biol., 73, 12171220.[ISI][Medline]
Kopelowitz,J., Hampe,C., Goldman,R., Reches,M. and Engelbergkulka,H. (1992) J. Mol. Biol., 225, 261269.[ISI][Medline]
Mottagui-Tabar,S., Björnsson,A. and Isaksson,L.A. (1994) EMBO J., 13, 249257.[Abstract]
Nakamura,Y., Ito,K. and Isaksson,L.A. (1996) Cell, 87, 147150.[ISI][Medline]
Prockop, D.J. and Kivirikko, K.I. (1995) Annu. Rev. Biochem., 64, 403434.[ISI][Medline]
Sali,D., Bycroft,M. and Fersht,A.R. (1991) J. Mol. Biol. 220, 779788.[ISI][Medline]
Serrano,L., Horovitz,A., Avron,B., Bycroft,M. and Fersht,A.R. (1990) Biochemistry, 29, 93439352.[ISI][Medline]
Sherman,F., Stewart,J.W. and Tsunasawa,S. (1985) BioEssays, 3, 2731.[ISI][Medline]
Starich,M.R., Sandman,K., Reeve,J.N. and Summers,M.F. (1996) J. Mol. Biol., 255, 187203.[ISI][Medline]
Stewart,A.E., Arfin,S.M. and Bradshaw,R.A. (1995) J. Biol. Chem., 270, 2528.
Tate,W.P. and Mannering,S.A. (1996) Mol. Microbiol., 21, 213219.[ISI][Medline]
Thornton, J.M. and Sibanda, B.L. (1983) J. Mol. Biol., 167, 443460.[ISI][Medline]
Trifonov,E.N. (1987) J. Mol. Biol., 194, 643652.[ISI][Medline]
Varshavsky,A. (1992) Cell, 69, 725735.[ISI][Medline]
Varshavsky,A. (1996) Proc. Natl Acad. Sci. USA, 93, 1214212149.
Yoshida,A. and Lin,M. (1972) J. Biol. Chem., 247, 952957.
Received February 20, 1998; revised September 10, 1998; accepted September 14, 1998.