Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue, Philadelphia, PA 19111, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: fusion protein/linker sequence/loop library/program/proteins
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The construction of a fusion protein involves the linking of two macromolecules by a linker sequence. The macromolecules involved usually include proteins and globular domains of proteins. The selection of the linker sequence is particularly important in the construction of functional fusion proteins. In addition to the necessity for an appropriate amino acid composition, the overall folding of the linker must be taken into consideration. Robinson and Sauer (1998) found that the linker sequence composition could have a significant effect on the folding stability of a fusion protein. It is also unfavorable to have a linker sequence with a high propensity for forming -helical or ß-strand structures, because these would limit the flexibility of the fusion protein and consequently affect its functional activity. Therefore, the design of a linker sequence often requires careful consideration in order to avoid such secondary structural elements. Unfortunately, there are no reliable selection criteria available for use in linker design. Most current linker design selection processes are still largely dependent on intuition. Although significant progress has been made in predicting secondary structures of proteins based on primary sequences (Barton, 1995
; Jones, 1997
), our understanding of sequencestructure correlation is still limited. On average, current algorithms can produce a prediction accuracy of about 72%, with higher reliability in helix structure predictions and a lower level of confidence in ß-strand and loop region predictions (Barton, 1995
). Such a process of selection by intuition often leaves great uncertainty, particularly in the case of longer linker sequence selections.
We have developed a computer program, LINKER, that automatically generates a set of linker sequences according to the input parameters. This program is based on the assumption that the observed loop sequences in the X-ray crystal structures or the NMR solution structures are likely to adopt an extended conformation as linkers in the fusion protein. The program searches a loop library derived from the Brookhaven Protein DataBank (PDB) (Bernstein et al., 1977). The basic input to the program is the desired linker sequence length. The output of the program is a set of amino acid sequences of specified length. We have also incorporated some optional input parameters specifically designed to help users to select linker sequences that would fit in with their particular research needs.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our loop library is derived from the Brookhaven Protein DataBank (PDB) (September 1999 release). Of 9600 PDB files, 27 554 loop sequences of various lengths were extracted. In order to ensure that all loops in the loop library are structurally observed, loop sequences were first extracted according to authors' records in the PDB files. Each sequence was then compared with a second loop library generated by the DSSP program which automatically identifies secondary and loop structures of proteins based on predefined backbone dihedral angles, i.e. the ,
values and hydrogen bonding patterns (Kabsch and Sander, 1983
). The final loop library used by the LINKER program contains sequences that are consistent with both selection criteria. The extracted loop sequences were then grouped according to their sequence lengths. Loop lengths of less than four residues were removed from the library. The redundant loop sequences were also removed from the library. It should be noted that while most `non-redundant' structural libraries exclude structures of homologous sequences (Lessel and Schomburg, 1997
), we define `non-redundant' as non-identical loop sequences, since sequence diversity amongst homologous proteins most likely occurs in the loop regions. For the purpose of generating sequences with extended conformations, it is necessary to include loop sequences of homologous proteins.
We also removed hairpin loops from the loop library which reduced the number of loops in the library to 17 870. Hairpin loops are a result of intra-loop hydrogen bond interactions between main-chain atoms of the loop structure, usually between an amide nitrogen of a loop residue and a carbonyl oxygen (NO distance) two or three residues C-terminal to the residue (Hutchinson and Thornton, 1994; Sibanda et al., 1994
). Unlike compact loop structures, where intra-loop interactions are between side chains of the loop residues (Leszcynski and Rose, 1986
), hairpin loops are geometrically well defined and are potentially sequence dependent (Hutchinson and Thornton, 1994
). The rigid structure of hairpin loops could potentially limit the degree of freedom between domains of the fusion protein. The program excluded the hairpin loops from our library by inspecting the NiOi + n distances within each loop structure, where n = 2, 3, 4, 5, 6. If one of the NO distances in a particular loop structure is shorter than 4.0 Å, that loop sequence is removed from the library.
The current loop library contains 14 734 loop sequences. Figure 1 shows the distribution of various loop lengths in the library. It is obvious that a large proportion of the loop sequences are between lengths of 4 and 9 residues. The medium loops (1020 residues) constitute about 12% of the library, while the longer loops occupy only a fractional percentage of the library.
|
The basic input to the program is the desired length of the linker sequence. The program accepts either the number of residues or a distance in angstroms, which is subsequently converted to the number of residues by assuming an extended conformation for the polypeptide. The program calculates the length of the linker by assuming that the polypeptides are completely extended and that the length of each amino acid is 3.25 Å. Since loops structures are rarely found fully extended in proteins, we recommend that users add 1015% to the desired linker length. Upon acceptance of the input parameter, the program searches through the loop library and selects loop sequences with the specified length to an output file. In each output file, a table of Eisenberg consensus hydrophobicity scale values is attached (Eisenberg, 1984). The user may plot a hydrophobic profile for a selected linker sequence that is in the output file. A hydrophobicity profile may help the users to select linkers with appropriate sequence characteristics (Figure 2
).
|
|
![]() |
Program usage and output |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Figure 3 shows the flow chart of an example of the LINKER selecting linker sequences for constructing a DNA binding protein. By entering various optional input parameters, LINKER generates eight possible linker sequences (see Figure 2
). The optional input parameters are effective in reducing the number of output sequences. This feature is especially significant in situations where shorter linker sequences are sought. As mentioned earlier, the majority of the loop library contains sequences 510 residues long (Figure 1
). We performed test runs for linker sequences of 5, 10, 15 and 20 residues. In each case, a set of input parameters was applied sequentially to monitor the reduction of output sequences. For five-residue sequences (Table II
), the number of output sequences was decreased by 20% with the removal of sequences sensitive to thrombin; with the input of four endonucleases, the output was further reduced by 6.9%; the number of output sequences was reduced by another 69% after sequences containing charged residues had been removed. With the same sequential application of the optional input parameters, the output reductions are 37, 17 and 86% for 10-residue linker sequences (Table II
). The greater effect of the optional input parameters for longer linker sequences reflects their inherent greater sequence variability.
|
![]() |
Availability |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Barton,G.J. (1995) Curr. Opin. Struct. Biol., 5, 372376.[ISI][Medline]
Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535542.[ISI][Medline]
Bird,R.E., Hardman,K.D., Jacobson,J.W., Johnson,S., Kaufman,B.M., Lee,S.-M., Pope,H.S., Riodan,G.S. and Whitlow,M. (1988) Science, 242, 423426.[ISI][Medline]
Bulow,L. (1990) Biochem. Soc. Symp., 57, 123133.[Medline]
Bushman,F.D. and Miller,M.D. (1997) J. Virol., 71, 458464.[Abstract]
Eisenberg,D. (1984) Annu. Rev. Biochem., 53, 595623.[ISI][Medline]
Forsberg,G., Samuelsson,E., Wadensten,H., Moks,T. and Hartmans,M. (1992) In Angeletti,R.H. (ed.), Techniques in Protein Chemistry. Academic Press, San Diego, pp. 329336.
Goulaouic,H. and Chow,S.A. (1996) J. Virol., 70, 3746.[Abstract]
Hutchinson,E.G. and Thornton,J.M. (1994) Protein Sci., 3, 22072216.
Jones,D.T. (1997) Curr. Opin. Struct. Biol., 7, 377387.[ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Kim,J.-S. and Pabo,C.O. (1997) J. Biol. Chem., 272, 2979529800.
Lessel,U. and Schomburg,D. (1997) Protein Engng, 10, 659664.[Abstract]
Leszcynski,J.F. and Rose,G.D. (1986) Science, 234, 849855.[ISI][Medline]
Robinson,C.R. and Sauer,R.T. (1998) Proc. Natl Acad. Sci. USA, 95, 59295934.
Samuelsson,E., Wadensten,H., Hartmans,M., Moks,T. and Uhlen,M. (1991) Biotechnology, 9, 363366.[ISI][Medline]
Sibanda,B.L., Blundell,T.L. and Thornton,J.M. (1994) J. Mol. Biol., 206, 759777.
Tang,L., Li,J., Katz,D.S. and Feng,J.-A. (2000) Biochem., 39, 30523060.[ISI][Medline]
Received June 23, 1999; revised January 19, 2000; accepted February 20, 2000.