Physicochemical factors for discriminating between soluble and membrane proteins: hydrophobicity of helical segments and protein length

Shigeki Mitaku1 and Takatsugu Hirokawa

Tokyo University of Agriculture and Technology, Department of Biotechnology, Koganei, Tokyo 184-8588, Japan


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The average hydrophobicity of a polypeptide segment is considered to be the most important factor in the formation of transmembrane helices, and the partitioning of the most hydrophobic (MH) segment into the alternative nonpolar environment, a membrane or hydrophobic core of a globular protein may determine the type of protein produced. In order to elucidate the importance of the MH segment in determining which of the two types of protein results from a given amino acid sequence, we statistically studied the characteristics of MH helices, longer than 19 residues in length, in 97 membrane proteins whose three-dimensional structure or topology is known, as well as 397 soluble proteins selected from the Protein Data Bank. The average hydrophobicity of MH helices in membrane proteins had a characteristic relationship with the length of the protein. All MH helices in membrane proteins that were longer than 500 residues had a hydrophobicity greater than 1.75 (Kyte and Doolittle scale), while the MH helices in membrane proteins smaller than 100 residues could be as hydrophilic as 0.1. The possibility of developing a method to discriminate membrane proteins from soluble ones, based on the effect of size on the type of protein produced, is discussed.

Keywords: hydrophobicity/length of protein/membrane protein/protein folding/transmembrane helix


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Recent developments in the human genome project have increased the need for more information about the structure and function of proteins in total proteomes. However, many amino acid sequences in total proteomes show no sequence homology with other known proteins. For such proteins, the information has to be extracted from the sequences alone, which has proved difficult. A realistic approach to the problem of computational analysis of total proteomes is to classify amino acid sequences into several categories related to their function.

The location of proteins in the cell is closely related to their function, and the classification of proteins by their location is a prerequisite for more detailed structure prediction. The location of proteins in a cell is determined according to two processes during the early stage of protein biosynthesis. If an amino acid sequence has a signal peptide, the protein is translocated through the membrane, and transmembrane segments anchor the protein to the membrane. The present study focused on the problem of how to discriminate between membrane and soluble proteins. Specifically, our objectives were to determine whether the hydrocarbon region of a membrane protein or the hydrophobic core of a soluble protein were essential for targeting the most hydrophobic (MH) segment into alternative nonpolar environments

The importance of MH helices for the problem of discrimination between soluble and membrane proteins was first addressed by Klein et al. (1985), using the average hydrophobicity of MH helices together with some statistical parameters characterizing transmembrane helices. The accuracy of discrimination by this method was as good as 95%. However, better accuracy is necessary for the analyses of a total proteome, since membrane proteins make up only a minor fraction of a proteome (Arkin et al., 1997Go; Frishman and Mewes, 1997Go; Wallin and von Heijne, 1998Go). Even when soluble proteins are predicted with an error of only 5%, the fraction of false positives becomes much larger than this value because a total proteome contains far fewer membrane proteins than soluble proteins.

Many previous methods of transmembrane helix prediction, including the method of Klein et al. (1985), implicitly assumed that transmembrane helices are influenced only by short term characteristics, such as hydrophobicity (Kyte and Doolittle, 1982Go; Steitz et al., 1982Go), helical periodicity (Eisenberg et al., 1984Go; Mitaku et al., 1984Go, 1985Go; Rees et al., 1989Go; Jähnig, 1990Go), propensity (Jones et al., 1994Go), positive charge (von Heijne, 1989Go) and alignment of amino acid sequences (Persson and Argos, 1994Go). The hydropathy plot, for example, uses an average hydropathy index for several residues, which means that the interaction within a region of only several residues is assumed to account for the stabilization of transmembrane helices. However, the hydrophobic core of a soluble protein is formed by many other parts of the same protein, suggesting the importance of protein size for targeting MH helices.

In the present work, we determined whether the length of the protein affects discrimination between soluble and membrane proteins. The results clearly indicate that in addition to the average hydrophobicity of MH helices, protein size was an important factor in determining protein type.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The amino acid sequences of 397 soluble proteins and 97 membrane proteins were used for comparing sequence characteristics of soluble and membrane proteins. Amino acid sequences of soluble proteins were selected from 901 PDB_SELECT entries, which are based on a 25% sequence identity cut-off (Hobohm et al., 1992Go; Hobohm and Sander, 1994Go). First, membrane proteins were excluded from all entries, which reduced the number of chains from 901 to 892. Second, proteins for which NMR structures are available were omitted, reducing the number of entries to 745. Third, we eliminated entries that have unsuitable regions, those lacking coordinates or having non-sequential residues, reducing the number further to 663. Finally, we selected proteins having helices or merge helices longer than 19 residues. A merge helix is defined as a pair of helices which are linked by a segment shorter than six residues. We added merge helices to the dataset of long helices, because a long hydrophobic helix may be bent within a soluble protein by the aqueous environment. The number of soluble proteins was further reduced to 397 by the elimination of proteins which do not have such helical segments. The selected soluble proteins contained 489 real helices longer than 19 residues and 745 merge helices, as shown in Table IGo.


View this table:
[in this window]
[in a new window]
 
Table I. Number of proteins and helices for three ranges of protein size
 
We also used two datasets of amino acid sequences from membrane proteins: (i) 16 membrane proteins with 3D- structures from PDB, containing the photosynthetic reaction center 1prc (H, L and M) (Deisenhofer et al., 1985Go), bacteriorhodopsin 2brd (Grigorieff et al., 1996Go), cytochrome c oxidase 1occ (A, B, C, D, G, I, J, K, L and M) (Tsukihara et al., 1995Go) and the light-harvesting complex 1kzu (A and B) (Prince et al., 1997Go) and (ii) 80 membrane proteins whose topology is given by Fariselli and Casadio (1996). Membrane proteins were selected by the same protocols as soluble proteins; the sequence identity cut-off was 25% and proteins for which NMR data is available were omitted. For example, a photoreaction center, 1pcr (Allen et al., 1986Go) was omitted because of the excessive sequence identity with 1prc (Deisenhofer et al., 1985Go), and melittin 2mltA (Terwilliger and Eisenberg, 1982Go) was not used because it was too short. The total number of transmembrane helices was 357. All PDB (397 soluble proteins and 16 membrane proteins) and SWISS-PROT entries (80 membrane proteins) are available online at http://biophys.bio.tuat.ac.jp/~hirokawa/html/.

As shown in Table IGo, the number of real helices in soluble proteins is comparable with that of transmembrane helices, particularly for proteins smaller than 100 residues and larger than 500 residues. Since the purpose of the present study was to compare the characteristics of helical segments in soluble and membrane proteins, a sufficiently large number of proteins and helical segments is necessary in order to reach any significant conclusions. Therefore, merge helices were added to the dataset of real helices in soluble proteins. The total number of helices was 1234 in soluble proteins, including 745 merge helices. The fraction of membrane proteins in each size range was 27% for L < 100, 17% for 100 <= L < 500 and 33% for L >= 500. Because the characteristics of amino acid sequences from signal peptides are different from those of true transmembrane helices, we only used the amino acid sequences from mature proteins, which do not contain signal peptides. Therefore, we define the length of a protein as the number of amino acids present in the mature protein.

The hydrophobicity of polypeptide segments was evaluated by the hydropathy index of Kyte and Doolittle (1982). The average hydrophobicity, , is calculated using the following equation:

in which H(i) is the hydropathy index of the ith residue, {segment} indicates a set of sequence numbers in a helical segment, and n indicates the segment's length.


    Results
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The hydrophobicity of helices from soluble and membrane proteins was compared in order to elucidate the characteristics of both types of protein. The histograms in Figure 1Go show the average hydrophobicity of helices in both soluble and membrane proteins. The peak of the hydrophobicity distribution for transmembrane helices was much higher than that of helices in soluble proteins. About 49% of the transmembrane helices were more hydrophobic than any helical segment in soluble proteins. The threshold value used for this discrimination was 1.75 [hydropathy scale of Kyte and Doolittle (1982)]. However, there was a wide overlapping region of the average hydrophobicity, from 0 to 1.75, in which more than 51% of the transmembrane helices coexist with the helices of soluble proteins. If we set a threshold value of 1.75 for discrimination, about 50% of membrane proteins were predicted with certainty. When the threshold was lowered, more membrane proteins were predicted, but the accuracy of discrimination was worse.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. Histogram of the average hydrophobicity of helices in soluble (upper) and membrane (lower) proteins. The population of helices longer than 19 residues in soluble proteins and all transmembrane helices are plotted as a function of the average hydrophobicity. Solid and hatched bars represent the fraction of the MH helices in proteins. Real helices are represented by solid bars, and merge helices are shown by hatched bars. Open bars represent other real and merge helices.

 
However, we did not have to predict all transmembrane helices to discriminate membrane proteins from soluble ones, as pointed out by Klein et al. (1985). Correct prediction of only one transmembrane segment in a polypeptide was enough for this purpose. Solid bars in Figure 1Go represent the most hydrophobic (MH) helices in all of the proteins. The fraction of transmembrane helices in the overlapping region decreased significantly when the MH helices were used instead of all transmembrane helices. About 76% of the MH transmembrane helices were above the threshold of 1.75. Nevertheless, 25% of the MH transmembrane helices coexist with the helices of soluble proteins, suggesting that there are other factors that are necessary for better discrimination.

In Figure 2Go, protein length is plotted as a function of the average hydrophobicity of MH transmembrane helices. A characteristic relationship was observed between the two parameters. The average hydrophobicity of MH transmembrane helices was higher than 1.75 for membrane proteins longer than 500 residues, whereas for a number of MH transmembrane helices in membrane proteins smaller than 100 residues, hydrophobicity values were as low as 0.1.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2. The length of membrane proteins plotted as a function of the average hydrophobicity of MH transmembrane helices. Membrane proteins are classified into four categories by the average hydrophobicity of the MH helices: region I ( < 0), region II (0 <= < 1.0), region III (1.0 <= < 1.75) and region IV ( >= 1.75).

 
The histograms of protein length are shown in Figure 3aGo–d for proteins whose MH helices were located in region I ( < 0), region II (0 <= < 1.0), region III (1.0 <= < 1.75) and region IV ( >= 1.75) of Figure 2Go, respectively. Regions II and III correspond to the halves of the overlapping region in Figure 1Go. Region I contained only soluble proteins, the peak of the size distribution being between 100 and 200 residues (Figure 3aGo), whereas region IV included only membrane proteins, and the peak of the size distribution was between 300 and 400 residues (Figure 3dGo). Although region II contained MH helices from both soluble and membrane proteins (Figure 3bGo), the proportion of soluble proteins smaller than 100 residues seemed relatively small and the proportion of membrane proteins in the same size region was large. Only region III contained a wide overlapping region.






View larger version (81K):
[in this window]
[in a new window]
 
Fig. 3. Histograms of protein length for MH real helices (hatched bars), MH merge helices (open bars) in soluble proteins and MH transmembrane helices (solid bars) in regions I, II, III and IV, which are defined in Figure 2Go: (a) < 0 (region I); (b) 0 <= < 1.0 (region II); (c) 1.0 <= < 1.75 (region III); (d) >= 1.75 (region IV).

 
Figure 4a, Gob and c shows histograms similar to Figure 1Go, in which the dataset is divided into three categories according to protein size, L: (a) L >= 500; (b) 100 <= L < 500; (c) L < 100. The hydrophobicity of MH helices in membrane proteins longer than 500 residues was higher than 1.5, and the separation between soluble and membrane proteins was complete (Figure 4aGo). As the size of the proteins decreased, the distribution of the MH helices shifted to the lower hydrophobicity for both soluble and membrane proteins. The average hydrophobicity of MH helices in soluble proteins larger than 500 residues showed a peak at 0.5–0.75. The peak hydrophobicity of MH helices in soluble proteins of intermediate size (100 <= L < 500) was in the range between 0 and 0.25 (Figure 4bGo), whereas the peak for soluble proteins smaller than 100 residues was below zero (Figure 4cGo). If we used this dependence on protein size, the discrimination rate improved from 76 to ~90%, indicating that the protein size is one of the essential factors for discrimination between soluble and membrane proteins.





View larger version (55K):
[in this window]
[in a new window]
 
Fig. 4. Histograms of the average hydrophobicity of MH helices for soluble protein having MH real helices (hatched bars), MH merge helices (open bars) and membrane proteins (solid bars). In this case, all proteins are classified into three categories by the length, L, of the protein: (a) L >= 500; (b) 100 <= L < 500 and (c) L < 100.

 

    Discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
The present results show that two parameters, protein size and the average hydrophobicity of the MH helix, may be used to develop a new method for the prediction of membrane proteins from amino acid sequences. The prediction of membrane proteins using the average hydrophobicity of MH helices was originally proposed by Klein et al. (1985) and the importance of this parameter was confirmed using a larger dataset in the present work. About 75% of membrane proteins could be distinguished from soluble proteins by this parameter. However, we identified a region of the average hydrophobicity, between 0 and 1.75, in which the two types of proteins coexist. Roughly 25% of membrane proteins were found in this region (Figure 1Go). Klein et al. (1985) also realized the need for other parameters to improve the accuracy of discrimination between soluble and membrane proteins, and they devised a statistical parameter characterizing transmembrane helical segments.

In the present work, however, a completely different parameter, protein length, was found to be essential for targeting MH helices to membrane or soluble proteins. The average hydrophobicity of MH helices in membrane proteins longer than 500 residues was higher than 1.5. Because the average hydrophobicity of MH helices of soluble proteins was lower than 1.25, the type of protein could be well distinguished for the dataset of proteins longer than 500 residues. On the other hand, although many membrane proteins shorter than 100 residues were found in the region between 0 and 1.75, the distribution of the hydrophobicity of helices in soluble proteins correspondingly shifted to the lower values. Thus, the discrimination between soluble and membrane proteins may be improved by using the relationship between protein length and the average hydrophobicity of the MH helices.

However, these two parameters are not enough for complete discrimination, as seen from the overlapping regions in Figures 3 and 4GoGo. This suggests that other factors must exist which stabilize MH helices in the membrane or in the hydrophobic core of soluble proteins. Recently, we made public a system for membrane protein discrimination and transmembrane helix prediction (SOSUI) (http://www.tuat.ac.jp/~mitaku/adv_sosui/) (Hirokawa et al., 1998Go), in which the two parameters discussed in this work were incorporated and the problems mentioned above were partly solved. The details of the algorithm of the discrimination in the SOSUI system will be described elsewhere.

The discrimination between soluble and membrane proteins is generally related to the problem of the early stage of protein folding. The present work showed that the length of a protein is correlated with the fate of its MH segment. When the MH segment has intermediate hydrophobicity, a short polypeptide tends to become a membrane protein, while a long polypeptide is folded to form a soluble protein. This correlation between protein size and type seems reasonable from the physicochemical viewpoint. A hydrophobic segment is energetically unfavorable in water and tries to find a nonpolar environment. However, a short polypeptide cannot make a sufficiently nonpolar environment to cover the hydrophobic segment. Therefore, a hydrophobic segment in a short protein tends to be partitioned into a membrane.

However, the penetration of a polypeptide into a membrane is mostly driven by the translocation machinery of the cell (Sakaguchi et al., 1992Go; Rapoport et al., 1996Go). Thus, the physicochemical consideration is not enough for understanding the correlation between protein size and type. The translocation machinery has to recognize transmembrane segments by some local sequence patterns, and the relationship between such sequence patterns and the length of membrane proteins is still unclear. More theoretical and experimental research is necessary to elucidate the physical mechanism of the size effect on the discrimination between soluble and membrane proteins.


    Acknowledgments
 
This work was partly supported by Grant-in-Aid for basic research and priority area research of `Genome Science' from Monbusho (Ministry of Education, Science, Sports and Culture) of Japan.


    Notes
 
1 To whom correspondence should be addressed; email: mitaku{at}cc.tuat.ac.jp Back


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 References
 
Allen,J.P., Feher,G., Yeates,T.O., Rees,D.C., Deisenhofer,J., Michel,H. and Huber,R. (1986) Proc. Natl Acad. Sci. USA, 83, 8589–8593.[Abstract]

Arkin,I.T., Brunger,A.T. and Engelman,D.M. (1997) Proteins, 28, 465–466.[ISI][Medline]

Deisenhofer,J., Epp,O., Miki,K., Huber,R. and Michel,H. (1985) Nature, 318, 618–624.[ISI]

Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1984) Proc. Natl Acad. Sci. USA, 81, 140–144.[Abstract]

Fariselli,P. and Casadio,R. (1996) CABIOS, 12, 41–48.[Abstract]

Frishman,D. and Mewes,H.W. (1997) Nature Struct. Biol., 4, 626–628.[ISI][Medline]

Grigorieff,N., Ceska,T.A., Downing,K.H., Baldwin,J.M. and Henderson,R. (1996) J. Mol. Biol., 259, 393–421.[ISI][Medline]

Hirokawa,T., Boon-Chieng,S. and Mitaku,S. (1998) Bioinformatics, 14, 378–379.[Abstract]

Hobohm,U. and Sander,C. (1994) Protein Sci., 3, 522–524.[Abstract/Free Full Text]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Jähnig,F. (1990) TIBS, 15, 93–95.[Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1994) Biochemistry, 33, 3038–3049.[ISI][Medline]

Klein,P., Kanehisa,M. and DeLisi,C. (1985) Biochim. Biophys. Acta, 815, 468–476.[ISI][Medline]

Kyte,J. and Doolittle,R.F. (1982) J. Mol. Biol., 157, 105–132.[ISI][Medline]

Mitaku,S., Hoshi,S., Abe,T. and Kataoka,R. (1984) J. Phys. Soc. Jpn, 53, 4083–4090.[ISI]

Mitaku,S., Hoshi,S. and Kataoka,R. (1985) J. Phys. Soc. Jpn, 54, 2047–2054.[ISI]

Persson,B. and Argos,P. (1994) J. Mol. Biol., 237, 182–192.[ISI][Medline]

Prince,S.M., Papiz,M.Z., Freer,A.A., McDermott,G., Hawthornthwaite-Lawless,A.M., Cogdell,R.J. and Isaacs,N.M. (1997) J. Mol. Biol., 268, 412–423.[ISI][Medline]

Rapoport,T.A., Jungnickel,B. and Kutay,U. (1996) Annu. Rev. Biochem, 65, 271–303.[ISI][Medline]

Rees,D.C., DeAntonio,L. and Eisenberg,D. (1989) Science, 245, 510–513.[ISI][Medline]

Sakaguchi,M., Tomiyoshi,R., Kuroiwa,T., Mihara,K. and Omura,T. (1992) Proc. Natl Acad. Sci. USA, 89, 16–19.[Abstract]

Steitz,T.A., Goldman,A. and Engelman,D.M.. (1982) Biophys. J., 37, 124–125.[ISI]

Terwilliger,T.C. and Eisenberg,D. (1982) J. Biol. Chem., 257, 6010–6015.[Abstract/Free Full Text]

Tsukihara,T., Aoyama,H., Yamashita,E., Tomizaki,T., Yamaguchi,H., Shinzawa-Itoh,K., Nakashima,R., Yaono,R. and Yoshikawa,S. (1995) Science, 269, 1069–1074.[ISI][Medline]

von Heijne,G. (1989) Nature, 341, 456–458.[ISI][Medline]

Wallin,E. and von Heijne,G. (1998) Protein Sci., 7, 1029–1038.[Abstract/Free Full Text]

Received February 26, 1999; revised June 2, 1999; accepted July 12, 1999.