LINKER: a program to generate linker sequences for fusion proteins

Chiquito J. Crasto and Jin-an Feng1

Institute for Cancer Research, Fox Chase Cancer Center, 7701 Burholme Avenue, Philadelphia, PA 19111, USA


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
The construction of functional fusion proteins often requires a linker sequence that adopts an extended conformation to allow for maximal flexibility. Linker sequences are generally selected based on intuition. Without a reliable selection criterion, the design of such linkers is often difficult, particularly in situations where longer linker sequences are required. Here we describe a program called LINKER which can automatically generate a set of linker sequences that are known to adopt extended conformations as determined by X-ray crystallography and NMR. The only required input to the program is the desired linker sequence length. The program is specifically designed to assist in fusion protein construction. A number of optional input parameters have been incorporated so that users are able to enhance sequence selection based on specific applications. The program output simply contains a set of sequences with a specified length. This program should be a useful tool in both the biotechnology industry and biomedical research. It can be accessed through the Web page http://www.fccc.edu/research/labs/feng/linker.html.

Keywords: fusion protein/linker sequence/loop library/program/proteins


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
The gene fusion technique has become an increasingly useful tool in a variety of fields of biomedical research. In structural biology, the construction of recombinant fusion proteins has been used as a means to increase the expression of soluble proteins and to facilitate protein purification (Altman et al., 1991Go; Samuelsson et al., 1991Go; Forsberg et al., 1992Go). The technique has been used to study the functional activity of proteins in in vitro assays (Goulaouic and Chow, 1996Go; Bushman and Miller, 1997Go). In recent years, a wide range of applications of the gene fusion technique have been reported in the field of biotechnology. These applications include the selection and production of antibodies (Bird et al., 1988Go) and the engineering of bifunctional enzymes (Bulow, 1990Go) and proteins with specialized functions, such as proteins that target specific genes (Kim and Pabo, 1997Go; Tang et al., 2000Go). The gene fusion technique is also expected to have extensive application in the field of structure-based protein engineering when our understanding of functional domains of proteins is improved with the increasing number of protein structures determined by X-ray crystallography and NMR spectroscopy.

The construction of a fusion protein involves the linking of two macromolecules by a linker sequence. The macromolecules involved usually include proteins and globular domains of proteins. The selection of the linker sequence is particularly important in the construction of functional fusion proteins. In addition to the necessity for an appropriate amino acid composition, the overall folding of the linker must be taken into consideration. Robinson and Sauer (1998) found that the linker sequence composition could have a significant effect on the folding stability of a fusion protein. It is also unfavorable to have a linker sequence with a high propensity for forming {alpha}-helical or ß-strand structures, because these would limit the flexibility of the fusion protein and consequently affect its functional activity. Therefore, the design of a linker sequence often requires careful consideration in order to avoid such secondary structural elements. Unfortunately, there are no reliable selection criteria available for use in linker design. Most current linker design selection processes are still largely dependent on intuition. Although significant progress has been made in predicting secondary structures of proteins based on primary sequences (Barton, 1995Go; Jones, 1997Go), our understanding of sequence–structure correlation is still limited. On average, current algorithms can produce a prediction accuracy of about 72%, with higher reliability in helix structure predictions and a lower level of confidence in ß-strand and loop region predictions (Barton, 1995Go). Such a process of selection by intuition often leaves great uncertainty, particularly in the case of longer linker sequence selections.

We have developed a computer program, LINKER, that automatically generates a set of linker sequences according to the input parameters. This program is based on the assumption that the observed loop sequences in the X-ray crystal structures or the NMR solution structures are likely to adopt an extended conformation as linkers in the fusion protein. The program searches a loop library derived from the Brookhaven Protein DataBank (PDB) (Bernstein et al., 1977Go). The basic input to the program is the desired linker sequence length. The output of the program is a set of amino acid sequences of specified length. We have also incorporated some optional input parameters specifically designed to help users to select linker sequences that would fit in with their particular research needs.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
Loop library construction

Our loop library is derived from the Brookhaven Protein DataBank (PDB) (September 1999 release). Of 9600 PDB files, 27 554 loop sequences of various lengths were extracted. In order to ensure that all loops in the loop library are structurally observed, loop sequences were first extracted according to authors' records in the PDB files. Each sequence was then compared with a second loop library generated by the DSSP program which automatically identifies secondary and loop structures of proteins based on predefined backbone dihedral angles, i.e. the {phi}, {Psi} values and hydrogen bonding patterns (Kabsch and Sander, 1983Go). The final loop library used by the LINKER program contains sequences that are consistent with both selection criteria. The extracted loop sequences were then grouped according to their sequence lengths. Loop lengths of less than four residues were removed from the library. The redundant loop sequences were also removed from the library. It should be noted that while most `non-redundant' structural libraries exclude structures of homologous sequences (Lessel and Schomburg, 1997Go), we define `non-redundant' as non-identical loop sequences, since sequence diversity amongst homologous proteins most likely occurs in the loop regions. For the purpose of generating sequences with extended conformations, it is necessary to include loop sequences of homologous proteins.

We also removed hairpin loops from the loop library which reduced the number of loops in the library to 17 870. Hairpin loops are a result of intra-loop hydrogen bond interactions between main-chain atoms of the loop structure, usually between an amide nitrogen of a loop residue and a carbonyl oxygen (N–O distance) two or three residues C-terminal to the residue (Hutchinson and Thornton, 1994Go; Sibanda et al., 1994Go). Unlike compact loop structures, where intra-loop interactions are between side chains of the loop residues (Leszcynski and Rose, 1986Go), hairpin loops are geometrically well defined and are potentially sequence dependent (Hutchinson and Thornton, 1994Go). The rigid structure of hairpin loops could potentially limit the degree of freedom between domains of the fusion protein. The program excluded the hairpin loops from our library by inspecting the Ni–Oi + n distances within each loop structure, where n = 2, 3, 4, 5, 6. If one of the N–O distances in a particular loop structure is shorter than 4.0 Å, that loop sequence is removed from the library.

The current loop library contains 14 734 loop sequences. Figure 1Go shows the distribution of various loop lengths in the library. It is obvious that a large proportion of the loop sequences are between lengths of 4 and 9 residues. The medium loops (10–20 residues) constitute about 12% of the library, while the longer loops occupy only a fractional percentage of the library.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1. Histogram showing the distribution of loop sizes of the loop library for the LINKER program.

 
Design of the LINKER program

The basic input to the program is the desired length of the linker sequence. The program accepts either the number of residues or a distance in angstroms, which is subsequently converted to the number of residues by assuming an extended conformation for the polypeptide. The program calculates the length of the linker by assuming that the polypeptides are completely extended and that the length of each amino acid is 3.25 Å. Since loops structures are rarely found fully extended in proteins, we recommend that users add 10–15% to the desired linker length. Upon acceptance of the input parameter, the program searches through the loop library and selects loop sequences with the specified length to an output file. In each output file, a table of Eisenberg consensus hydrophobicity scale values is attached (Eisenberg, 1984Go). The user may plot a hydrophobic profile for a selected linker sequence that is in the output file. A hydrophobicity profile may help the users to select linkers with appropriate sequence characteristics (Figure 2Go).



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 2. Hydrophobicity profile of an output sequence from a 15-residue linker sequence search. The Eisenberg consensus hydrophobicity scale is used here.

 
Since the program is specifically developed to help users to select linker sequences, we incorporated a few subroutines in the program to allow users to input additional parameters in order to refine the output. One issue of concern in selecting linker sequences is how to identify potential proteolytic sensitive sites. In this program, we incorporated the proteolytic sites of six of the most commonly used proteases, which include trypsin, chymotrypsin, thrombin, plasmin, papain and factor Xa (Table IGo). By entering the names of the proteases, the users can eliminate those linker sequences containing corresponding proteolytic sites. Designing a fusion gene often requires careful consideration of the gene sequence so that the entire fusion gene does not contain sites that are sensitive to the restriction endonucleases required to create fusion sites. This program allows the user to input the names of the restriction enzymes to be used in gene construction. The generated linker sequences that require coding genes containing sensitive sites for the corresponding enzymes are labeled with an asterisk. Should the user prefer to choose these sequences, an alternative set of codons ought to be considered. In the current version of the program, only some of the most common restriction endonuclease sites are incorporated in the program. A complete list of these enzymes can be found in the program web page. Users may ask the program to remove linker sequences that contain unfavorable residue types for their specific applications. For example, in constructing a DNA binding fusion protein, it is perhaps undesirable to include highly charged residues, such as lysine and arginine, in the linker sequence since they may form salt bridges with the phosphate backbone of the DNA, thus influencing the DNA binding property of the engineered protein. Alternatively, the program also allows the user to select linker sequences that contain amino acids of their choice.


View this table:
[in this window]
[in a new window]
 
Table I. List of proteolytic sensitive sites incorporated in the LINKER program
 

    Program usage and output
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
Depending on various input parameters, the program generates a list of linker sequences in an output file. To demonstrate the capabilities of the LINKER program, we present an example of the program operation. The test case here is to request linker sequences of a 15 amino acid polypeptide. As shown in Figure 3Go, the program generates 113 linker sequences that are 15 residues long. When optional parameters are included, the number of linker sequences generated is reduced considerably:



View larger version (30K):
[in this window]
[in a new window]
 
Fig. 3. Flow diagram showing the operation of the LINKER program generating 15-residue linker sequences based on different input parameters. The example shown here is for the selection of linker sequences of a DNA binding fusion protein. Sequences with charged residues are removed from the output list so that unfavorable interactions between the DNA backbone and the linker sequence are minimized.

 
  1. Protease-sensitive site: of 113 15-residue linker sequences, 89 are sensitive to chymostrypsin, 80 are sensitive to trypsin, papain and plasmin and 43 are sensitive to thrombin.
  2. Restriction endonuclease cutting site: of the genes encoding 113 15-residue polypeptides, 13 contain sequences sensitive to BamH1 cutting, one contains an EcoR1 site, five contain EcoRV sites and three contain HindIII sites.
  3. Amino acid composition preference: by specifying residue type preference, one can significantly limit the output linker sequences. By removing sequences containing lysine, arginine, aspartic acid and glutamic acid, the output sequences can be reduced to just five. Alternatively, removing sequences containing bulky hydrophobic residues, Phe, Trp, Leu and Ile, reduces the output to eight sequences.

Figure 3Go shows the flow chart of an example of the LINKER selecting linker sequences for constructing a DNA binding protein. By entering various optional input parameters, LINKER generates eight possible linker sequences (see Figure 2Go). The optional input parameters are effective in reducing the number of output sequences. This feature is especially significant in situations where shorter linker sequences are sought. As mentioned earlier, the majority of the loop library contains sequences 5–10 residues long (Figure 1Go). We performed test runs for linker sequences of 5, 10, 15 and 20 residues. In each case, a set of input parameters was applied sequentially to monitor the reduction of output sequences. For five-residue sequences (Table IIGo), the number of output sequences was decreased by 20% with the removal of sequences sensitive to thrombin; with the input of four endonucleases, the output was further reduced by 6.9%; the number of output sequences was reduced by another 69% after sequences containing charged residues had been removed. With the same sequential application of the optional input parameters, the output reductions are 37, 17 and 86% for 10-residue linker sequences (Table IIGo). The greater effect of the optional input parameters for longer linker sequences reflects their inherent greater sequence variability.


View this table:
[in this window]
[in a new window]
 
Table II. Percentage reduction in output sequences with sequential input parameters
 
LINKER is a convenient tool to generate linker sequences for the construction of fusion proteins. The sequences suggested by the program are those of structurally observed loops and therefore are likely to adopt an extended conformation in the engineered fusion protein. Furthermore, the program allows the user to select sequences with chemical features relevant to their interests. To our knowledge, this is the only program specifically designed as a tool to aid linker sequence selections. It should be particularly useful in the field of biotechnology and for researchers who are developing functional fusion proteins.


    Availability
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
The current version of the program is written in Fortran with CGI interface and is compiled on an IRIX-based UNIX workstation. In order to facilitate easy access, we have set up a Web-based server service. Detailed instructions on how to use the program are provided at our Web site http://www.fccc.edu/research/labs/feng/linker.html. A submission page is designed to include all input parameters described in this paper. Upon entering the desired linker sequence length, users may select optional input parameters listed on the page to refine their output sequences. An output file containing suggested linker sequences is returned automatically after submission of the request.


    Notes
 
1 To whom correspondence should be addressed. E-mail: feng{at}guanyin.fccc.edu Back


    Acknowledgments
 
We thank Roland Dunbrack and Mike Sauder for providing access to their PDB database. We also thank members of the research computing facility at Fox Chase Cancer Center for advice on Perl scripting. C.J.C. is supported by a Plain & Fancy fellowship. This research was supported in part by a grant from National Institutes of Health, GM54630 (J.F.) and an appropriation from the Commonwealth of Pennsylvania.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Program usage and output
 Availability
 References
 
Altman,J.D., Henner,D., Nilsson,B., Anderson,S. and Kuntz,I.D. (1991) Protein Engng, 4, 593–600.[Abstract]

Barton,G.J. (1995) Curr. Opin. Struct. Biol., 5, 372–376.[ISI][Medline]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Bird,R.E., Hardman,K.D., Jacobson,J.W., Johnson,S., Kaufman,B.M., Lee,S.-M., Pope,H.S., Riodan,G.S. and Whitlow,M. (1988) Science, 242, 423–426.[ISI][Medline]

Bulow,L. (1990) Biochem. Soc. Symp., 57, 123–133.[Medline]

Bushman,F.D. and Miller,M.D. (1997) J. Virol., 71, 458–464.[Abstract]

Eisenberg,D. (1984) Annu. Rev. Biochem., 53, 595–623.[ISI][Medline]

Forsberg,G., Samuelsson,E., Wadensten,H., Moks,T. and Hartmans,M. (1992) In Angeletti,R.H. (ed.), Techniques in Protein Chemistry. Academic Press, San Diego, pp. 329–336.

Goulaouic,H. and Chow,S.A. (1996) J. Virol., 70, 37–46.[Abstract]

Hutchinson,E.G. and Thornton,J.M. (1994) Protein Sci., 3, 2207–2216.[Abstract/Free Full Text]

Jones,D.T. (1997) Curr. Opin. Struct. Biol., 7, 377–387.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Kim,J.-S. and Pabo,C.O. (1997) J. Biol. Chem., 272, 29795–29800.[Abstract/Free Full Text]

Lessel,U. and Schomburg,D. (1997) Protein Engng, 10, 659–664.[Abstract]

Leszcynski,J.F. and Rose,G.D. (1986) Science, 234, 849–855.[ISI][Medline]

Robinson,C.R. and Sauer,R.T. (1998) Proc. Natl Acad. Sci. USA, 95, 5929–5934.[Abstract/Free Full Text]

Samuelsson,E., Wadensten,H., Hartmans,M., Moks,T. and Uhlen,M. (1991) Biotechnology, 9, 363–366.[ISI][Medline]

Sibanda,B.L., Blundell,T.L. and Thornton,J.M. (1994) J. Mol. Biol., 206, 759–777.

Tang,L., Li,J., Katz,D.S. and Feng,J.-A. (2000) Biochem., 39, 3052–3060.[ISI][Medline]

Received June 23, 1999; revised January 19, 2000; accepted February 20, 2000.