1Department of Electronic and Information System Engineering, Faculty of Science and Technology, Hirosaki University, Hirosaki 036-8561, Japan 2Present address: Graduate School of Humanity and Science, Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo 112-8610, Japan
3 To whom correspondence should be addressed. e-mail: slsimi{at}si.hirosaki-u.ac.jp
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: binary topology pattern/functional identification/loop length/transmembrane protein/transmembrane topology
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
At the same time, this rather simple structural feature is making the prediction of the secondary structure (TM topology, i.e. the number of tms + loop lengths + N-tail location) from the amino acid sequence an easier task for TM proteins than for soluble proteins. In this context, many TM topology prediction methods have been proposed so far (e.g. Claros and von Heijne, 1994; Jones et al., 1994
; Rost et al., 1996
; Hirokawa et al., 1998
; Sonnhammer et al., 1998
; Tusnady and Simon, 1998
), although their prediction accuracy is not yet high enough (Moeller et al., 2001
; Chen et al., 2002
; Ikeda et al., 2002
). In order to obtain predictions of even higher accuracy practically, several consensus approaches have recently been tried by combining several of the proposed prediction methods (Promponas et al., 1999
; Nilsson et al., 2000
, 2002; Bertaccini and Trudell, 2002
; Ikeda et al., 2002
, 2003; Kall and Sonnhammer, 2002
).
One of the reasons why so much effort has been made in developing TM topology prediction methods is that there is a good possibility of classifying and identifying the functions of TM protein sequences from knowing their accurate TM topologies. For example, Tusnady et al. (Tusnady et al., 1997) suggested that 12-tms ABC transporter proteins are characterized by a specific and common TM topology pattern and that TM topology pattern analysis may significantly help the search for characteristic domains, in addition to sequence comparisons. From their analysis of four-tms receptors and channel proteins, Clements and Martin (Clements and Martin, 2002
) recently proposed a new idea for the functional identification of TM proteins by searching for characteristic patterns in the hydropathy profiles. It has also been reported that the lengths of the intracellular second and fourth loops of G-protein coupled receptors (GPCRs) are short and their lengths are strongly conserved, while the intracellular sixth loop, whose length is quite long, has a large variation in its length (Otaki and Firestein, 2001
). The authors indicated the possibility of classifying the GPCR functions according to the loop lengths. From these findings, it seems that the TM topology has been conserved to preserve the function of the TM protein in the evolutionary process more rigorously than the amino acid sequence.
In this study, we propose a novel method for classifying/identifying TM protein functions based on the TM topology, i.e. the length characteristics of the loops. In this method, the length of each loop is expressed as 1 or 0, depending on whether it is longer or shorter, respectively, than the threshold length defined for each loop, and then the TM topology is treated as a string of 0 and 1, which is named the binary topology pattern (BTP).
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The data used in this study are TM protein sequences taken from SwissProt 38.0 (Bairoch and Apweiler, 2000). Excluding the TM protein entries with a partly defined sequence (i.e. a fragment) and those with an unknown N-terminus location, we finally obtained 4348 sequences with numbers of tms from one to 30. We focused on 2097 entries with 212 tms, with the exception of nine tms, which were classified into 37 functional groups, including 10 others groups, according to the functional descriptions in the DE, CC or KW lines of the SwissProt database, as summarized in Table I. TM proteins with the annotation of a probable, putative or hypothetical function were only included in the others group.
|
It should be noted here that 26 entries included in the three-tms glutamate receptor group (35 entries in total are contained, see Table I) are registered in SwissProt 38.0 originally as four-tms glutamate receptor. Following the reports that the second tms in previously proposed topology models does not span the membrane but is a membrane pore-lining loop (Hollmann et al., 1994; Anand, 2000
), we decided to treat the 26 entries as a three-tms glutamate receptor in this study without changing the annotated N-tail location and segment positions for the remaining three segments.
The list of classified TM proteins used in this study is available at ftp://bioinfo.si.hirosaki-u.ac.jp/BTP/.
Binary topology pattern (BTP)
Consider an amino acid sequence belonging to a certain functional group of a TM protein with n-tms. Let li denote the length of the ith loop (1 i
n + 1). Here, l1 means the length of the N-tail loop. Next, we define the threshold length of the ith loop, lti, to be compared with li in order to assign a binary loop length, bi, to the ith loop by using the following criteria:
Here, 1 means that the loop is a long one, and 0 a short one. For example, for the case of a four-tms gap junction [gap junction protein CX32.2, SwissProt ID CX32_MICUN (Yoshizaki et al, 1994)] with the loop lengths l = {18, 36, 55, 20, 71} residues, the binary loop lengths are determined as b = (0, 1, 1, 0, 1) with lt = {47, 30, 28, 80, 42} residues, as illustrated in Figure 1.
|
where N is the number of entries contained in the functional group. The lengths of the individual loops vary from sequence to sequence even within a single functional group, although the degree of variation is different from loop to loop, as realized in Figure 2. The average binary loop lengths of the first loop (N-tail loop) and the second loop (12 loop) change rapidly from 1.0 to 0.0 with an increase of lti in narrow ranges of 20 and
35 residues, respectively, indicating that the loop lengths are quite close to each other. On the contrary, the lengths of the third loop (23 loop), fourth loop (34 loop) and fifth loop (C-tail) are much more divergent, the fifth loop in particular.
|
where m is the number of the functional groups with n-tms, and api and aqi are the ith average binary loop lengths of functional groups p and q, respectively. In Figure 3, the relationships of the r.m.s. difference, di, versus the threshold length, lti, are shown for individual loops. For respective loops, the threshold length giving the maximum value of the r.m.s. difference is considered to be the optimum threshold length, at which the average binary loop lengths calculated for the respective groups are expressed most discriminatively with each other. The threshold lengths were obtained, in this example, as 4450, 2931, 2729, 80 and 42 residues for the first, second, third, fourth and fifth loops, respectively. For the first, second and third loops of which threshold lengths were not determined uniquely, we adopted the average value of these lengths as appropriate for the optimum threshold length. It seems to be a proper treatment, since a was calculated uniquely without any changes with varying threshold lengths within these ranges (4450, 2931 and 2729 residues) obtained for the three loops. It is not the case for the fourth and fifth loops that have unique threshold lengths determined. Only a small deviation (even one residue) from the obtained threshold lengths (i.e. 80 and 42 residues, respectively) alters the average binary loop lengths, a explicitly. When we take 39 or 43 residues (instead of 42) as the threshold length for the fifth loop, for example, a5 becomes 0.99 or 0.92 (instead of 0.94 for 42 residues). Thus, 47, 30, 28, 80 and 42 are obtained as the optimum threshold lengths for individual loops in the ensemble of the four-tms functional groups, and a for the gap junction group, for example, is calculated as (0.00, 1.00, 1.00, 0.00, 0.94) with lt = {47, 30, 28, 80, 42}.
|
where * is the wild card meaning that the binary loop length is not defined for the ith loop. When we set the value of to 0.01, for example, the BTP, p, for the gap junction group becomes 0110*.
An appropriate value of should be assigned to each functional group so that the obtained BTP can have the maximal self-consistency of identification of its relevant function fulfilled. The self-consistency of the functional identification by the BTP, Sc, is defined as the geometric mean of the sensitivity, Sn, and the specificity, Sp:
Here, the sensitivity and the specificity are the ratios of the correctly identified entries to the total entries in the group and to the total predicted entries across the functional groups with the same n-tms, respectively.
Figure 4 shows how Sc varies with change of for the case of the four-tms ensemble. The self-consistencies increase at first in a range of small
, and then decrease with increasing
value, except for the receptor group. It is reasonable to employ the smallest value of
as the appropriate one for each functional group. Thus, the values of
determined for receptor, gap junction and others groups are 0.04, 0.01 and 0.16, respectively, which give the maximum values of Sc to their corresponding BTPs: 10010, 0110* and 0*0**, respectively. We note that all the patterns thus obtained are exclusive of one another: the binary digit is discrepant in four positions (except for the last position) between receptor and gap junction, in the first position between receptor and others, and in the third position between gap junction and others. The BTPs determined are expected to be exclusive of each other with these lt and
values so that the individual patterns can identify the corresponding functional groups discriminatively from each other. This means that the appropriate BTPs are determined successfully with these parameter values.
|
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Two-tms TM proteins
As seen in Table II, the five functional groups, including others, are discriminated from each other with high accuracies (0.938, 0.929, 0.779, 0.790 and 0.701 for potassium channel, sodium channel, receptor, sensor protein and others, respectively) by using the obtained BTPs. The obtained BTPs, even for the others group, are exclusive of one another in at least one digit position. The first position distinguishes the two channel groups from receptor and sensor protein, indicating that the first loops are long (i.e. 41 residues) for both the channels and short (<41 residues) for receptor and sensor protein. With the channels, the second and third loops characterize both the types complementarily: short (<151 residues, 0) and long (
209 residues, 1) for the potassium channel, and long (
151, 1) and short (<209, 0) for the sodium channel. Similarly, the second loop makes a distinction between receptor and sensor protein: long (
151 residues, 1) for the former and short for the latter (<151, 0).
|
Three-tms TM proteins
The BTPs obtained for the four functional groups, except for the others, are exclusive of each other and identify their respective sequences with quite high self-consistencies: glycoprotein, 0.966; glutamate receptor, 1.000; fumarate reductase, 0.957; kinase, 0.890, as shown in Table III. The exclusive digits are the first position with 1 for glycoprotein and glutamate receptor, and 0 for fumarate receptor, kinase and others. Thus, we could successfully perform functional identification of three-tms TM proteins using the obtained patterns, except the others group. The BTP obtained for glutamate receptor gives the perfect identification accuracy, with all the loops being long, in particular, the third loop, which is distinct from other groups with a short third loop. Similar to the case of two-tms TM proteins, we do not need to use the others pattern to identify others protein sequences in this case.
|
As shown in Table IV, the BTP of receptor identifies only the sequences of the receptor group with high sensitivity, 0.954, and specificity, 1.000 (self-consistency, 0.977). It should be noted that 146 entries identified by the receptor pattern belong to the ligand-gated ionic channels family with the N-out location, while the seven other entries do not. With the gap junction group, the obtained pattern identifies all the gap junction entries correctly, and only one others entry, in error. Furthermore, the identification accuracy of others is also still high enough, 0.904 (0.902), in contrast with the low accuracy in the cases of two-tms and three-tms TM proteins. The obtained patterns are exclusive to each other with these lt and values, so that the individual patterns can identify the corresponding functional groups discriminatively from each other.
|
In the five-tms transporter data set, various kinds of transporters are included, such as triose phosphate/phosphate translocator, cytochrome o ubiquinol oxidase subunit III, histidine transport system permease protein, etc., and there is a wide variety in the length of each loop, except for the fifth and sixth loops. This is reflected in the obtained pattern for the transporter group, in that only these two positions have a defined binary loop length and the others do not. Nevertheless, we can classify the five-tms protein sequences into two groups, transporter and others with high enough accuracies, 0.964 and 0.966, respectively, as shown in Table V.
|
The BTPs obtained with lt = {100, 15, 24, 11, 14, 38, 72} for the three six-tms functional groups show that channel, MIP channel and transporter are exclusive of each other, and their self-consistencies are 0.894, 0.934 and 0.849, respectively (Table VI). The MIP channel and transporter patterns each identify only one others entry, even though both patterns are not explicitly exclusive to the others pattern. By comparing the MIP channel and channel patterns, it is realized that not only the long N-tail but also the long 45 and C-tail loops distinguish channel from MIP channel. Since the performance of the others pattern is not high enough, it is not necessary to actually use this pattern in the six-tms case as well.
|
All the BTPs obtained are exclusive of one another, except for the cases between GPCR class A and others, and rhodopsin pump and others (Table VII). Except for the others pattern, the accuracies of the obtained patterns are quite high, for class C, class E and rhodopsin pump, in particular, which identify themselves perfectly without identifying any entries of other groups. Here, using 22 GPCR sequences which are registered in SwissProt 38.0 but were not used for determining the patterns, we tested the functional identification performance of the obtained patterns. Applying the class A pattern to these sequences, we identified 19 entries as GPCR class A, which are Burkitts lymphoma receptor, chemokine receptor-like protein, olfactory receptor-like protein, etc. Out of these 19 sequences, we confirmed 13 sequences that belonged to GPCR class A. The class B pattern identified two sequences, which are glucagon-like peptide 1 receptor precursor of GPCR class B.
|
Similar to the five-tms case, the eight-tms transporter group is a mixture of various kinds of transporters, such as calcium-transporting ATPase, potassium-transporting ATPase, renal sodium-dependent phosphate transporting protein, etc. As a result of this, the obtained BTPs are not exclusive of each other and are rather ambiguous. The discrimination ability of the transporter pattern, however, of 0.874 is still high enough, as depicted in Table VII, since 52 transporters out of 68 sequences are picked up by this pattern.
10-tms TM proteins
As illustrated in Table IX, the BTPs for ATPase, transporter, exchanger and others groups have high self-consistencies, 1.000, 0.949, 0.966 and 1.000, respectively. This result means that we can accurately classify 10-tms TM proteins into four functional groups, at least. Even looking at the patterns in Table IX, we can understand that each group has its special features for the lengths of the loops. For example, we observe that almost all the odd number loops of ATPase are long, except for the last one. In particular, the 45 loop is longer than 199, and such a long loop is not shown in the other 10-tms TM proteins. The transporter has short N-tail and 23 loops, and these characteristics are exclusive of ATPase. For exchanger, we determined the pattern at all positions, except the 67 and 89 loops, in spite of the small permission value.
|
By using the obtained BTPs, 11-tms TM protein sequences can be classified into two functional groups, exchanger and others with perfect accuracies, as seen in Table X. The 11-tms exchanger TM proteins are characterized by an extremely long sixth (56) loop and quite short seventh (67) and eighth (78) loops.
|
The BTPs for sodium transporter, sugar transporter and ABC transporter have 0.984, 0.949 and 0.923 sensitivity and 0.867, 0.838 and 1.000 specificity, respectively, as shown in Table XI. The three transporter patterns are exclusive of each other and identified only a few others entries. We note that the sugar transporter and sodium transporter patterns identified a fair number of entries of the others group in error (i.e. 13 and 8 entries, respectively). It seems that a number of transporter sequences are included in SwissProt without being given a functional annotation of the transporter.
|
Taken together, the obtained BTPs have high accuracies for consistently identifying the entries of individual functions: the sensitivity, specificity and self-consistency are 0.898, 0.897 and 0.893, respectively, averaged over the 37 functional groups including the others group, and 0.940, 0.934 and 0.935, respectively, over the 27 functional groups without the others group.
We did not use the information of the N-tail location in this methodology, as some functional groups contain both entries with different N-tail locations, although it is only a small fraction. Incorporating the N-tail location information into the BTP, after improving the prediction performance of the N-tail location, may help to further improve the ability of BTPs in functional classification/identification.
As seen in Table I, some functional groups, i.e. the four transporter groups and the two-tms receptor group comprise both eukaryotic and prokaryotic sequences. Nevertheless, the individual BTPs determined for these groups exhibit quite high identification accuracies, indicating that the TM topologies with the same function have been well conserved between prokaryotes and eukaryotes.
We did not deal with single-spanning TM proteins in this study. Since only four BTPs, at most, are available for the case of single-spanning TM protein, it is too small to classify all of the single spannings. This will be overcome, however, by applying this method in a stepwise manner, where classification into a few unified groups is performed at first, followed by subdivision into several lower-level subgroups within the individual upper-level groups. This stepwise approach is also applicable successfully to the functional classification of multi-spannings that have a deep hierarchical class structure, such as GPCR (Y.Inoue and T.Shimizu, manuscript in preparation).
Finally, we would like to point out that the TM topology pattern is available not only for functional classification/identification, but also for picking out the loops that seem to make the functional differences among the groups in the ensemble with the same n-tms, as already mentioned.
![]() |
Acknowledgements |
---|
|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Arai,M., Ikeda,M. and Shimizu,T. (2002) Gene, 304, 7786.[CrossRef][ISI]
Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 4548.
Bertaccini,E. and Trudell,J.R. (2002) Protein Eng., 15, 443453.
Chen,C.P., Kernytsky,A. and Rost,B. (2002) Protein Sci., 11, 27742791.
Claros,M.G. and von Heijne,G. (1994) Comput. Appl. Biosci., 10, 685686.[Medline]
Clements,J.D. and Martin,R.D. (2002) Eur. J. Biochem., 269, 21012107.
Hirokawa,T., Boon-Chieng,S. and Miraku,S. (1998) Bioinformatics, 14, 378379.[Abstract]
Hollmann,M., Maron,C. and Heinemann,S. (1994) Neuron, 13, 13311343.[ISI][Medline]
Ikeda,M., Arai,M., Lao,D.M. and Shimizu,T. (2002) In Silico Biol., 2, 1933.[Medline]
Ikeda,M., Arai,M., Okuno,T. and Shimizu,T. (2003) Nucleic Acids Res., 31, 406409.
Jones,D.T. (1998) FEBS Lett., 423, 281285.[CrossRef][ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1994) Biochemistry, 33, 30383049.[ISI][Medline]
Kall,L. and Sonnhammer,E.L.L. (2002) FEBS Lett., 532, 415418.[CrossRef][ISI][Medline]
Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,E.L.L. (2001) J. Mol. Biol., 305, 567580.[CrossRef][ISI][Medline]
Liu,J. and Rost,B. (2001) Protein Sci., 10, 19701979.
Mitaku,S., Ono,M., Hirokawa,T., Boon-Chieng,S. and Sonoyama,M. (1999) Biophys. Chem., 82, 165171.[CrossRef][ISI][Medline]
Moeller,S., Croning,M.D.R. and Apweiler,R. (2001) Bioinformatics, 17, 646653.
Nilsson,J., Persson,B. and von Heijne,G. (2000) FEBS Lett., 486, 267269.[CrossRef][ISI][Medline]
Nilsson,J., Persson,B. and von Heijne,G. (2002) Protein Sci., 11, 29742980.
Otaki,J.M. and Firestein,S. (2001) J. Theor. Biol., 211, 77100.[CrossRef][ISI][Medline]
Promponas,V.J., Palaios,G.A., Pasquier,C.M., Hamodrakas,J.S. and Hamodrakas,S.J. (1999) In Silico Biol., 1, 159162.[Medline]
Rost,B., Casadio,R. and Fariselli,P. (1996) In States,D.T., Agarwal,P., Gaasterland,T., Hunter,L. and Smith,R.F. (eds), Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 192200.
Serres,M.H., Gopal,S., Nahum,L.A., Liang,P., Gaasterland,T. and Riley,M. (2001) Genome Biol., 2, research0035.10035.7.
Sonnhammer,E.L., von Heijne,G. and Krogh,A. (1998) In Glasgow,J., Littlejohn,T., Major,F., Lathrop,R., Sankoff,D. and Sensen,C. (eds), Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 175182.
Stevens,T.J. and Arkin,I.T. (2000) Proteins: Struct. Funct. Genet., 39, 417420.[CrossRef][ISI][Medline]
Tusnady,G.E. and Simon,I. (1998) J. Mol. Biol., 283, 489506.[CrossRef][ISI][Medline]
Tusnady,G.E., Bakos,E., Varadi,A. and Sarkadi,B. (1997) FEBS Lett., 402, 13.[CrossRef][ISI][Medline]
Wallin,E. and von Heijne,G. (1998) Protein Sci., 7, 10291038.
Yoshizaki,G., Patino,P. and Thomas,P. (1994) Biol. Reprod., 51, 493503.[Abstract]
Received December 28, 2002; revised May 31, 2003; accepted June 8, 2003.