Faculty of Biology, Department of Cell Biology and Biophysics,University of Athens, Panepistimiopolis, Athens 15701, Greece
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: hydrophobicity analysis/membrane proteins/prediction/protein structure/transmembrane regions
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A number of methods or algorithms designed to locate the transmembrane regions of membrane proteins have been developed (von Heijne, 1992; Persson and Argos, 1994
; Cserzo et al., 1997
). Apparently, in several cases, better results are obtained when extra information coming from multiple alignments of homologous proteins is used (Rost et al., 1993
; Persson and Argos, 1994
). However, when homologies cannot be found in the databases, improvement of prediction methods using information contained in a protein sequence alone is important.
Prediction methods based on a hydrophobicity analysis can highlight most of the transmembrane regions of a protein (von Heijne, 1992). However, they fail to discriminate perfectly between segments corresponding to real transmembrane parts and simple, highly hydrophobic stretches of residues.
The algorithm presented in this paper refines information given by a hydrophobicity analysis, with the detection of favourable patterns that highlight potential termini (starts and ends) of transmembrane regions. Thus, highly hydrophobic stretches of residues that are not delimited by clear start and end configurations can be discarded. In contrast, favourable patterns can extract some transmembrane regions not clearly distinguishable by their hydrophobic composition.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
The PRED-TMR method, presented in this work, is based on a statistical study of transmembrane proteins. Despite the lack of precision and fidelity of SwissProt (Cserzo et al., 1997), we chose to collect the information needed from the whole database instead of using a limited set that may not be statistically representative.
Our method was optimized on a subset of 64 reliable proteins previously used in several prediction programs (Jones et al., 1994; Rost et al., 1995
; Aloy et al., 1997
) that were available in the public databases (the sequences used and the results obtained are presented on our web site at http://o2.db.uoa.gr/PRED-TMR/Results/). We relied on transmembrane segment topologies indicated in SwissProt release 35 or, when unavailable, in the paper by Rost et al. (1996).
The reliability of predictions was tested on several sets of sequences used for the rating of recent published algorithms. The PRED-TMR algorithm was also applied to the whole SwissProt database.
Information gathering
Some 9392 transmembrane proteins were automatically extracted from the SwissProt database, release 35, based on the presence in the feature table of the `TRANSMEM' keyword. The information relative to the transmembrane regions and their peripheral residues was stored in a database called DB-TMR. This database contains for each transmembrane segment:
This information can easily be filtered by organism or transmembrane type in order to refine the statistical analysis. The database and the description of the format used can be downloaded from our web site at http://o2.db.uoa.gr/DB-TMR/.
To minimize the impact of erroneous information, transmembrane segments that extend beyond the end(s) of the sequenced region or with unknown end-points are discarded before the statistical calculations.
Distribution of transmembrane segment length
The 40 548 transmembrane segments with reliable end-points contained in DB-TMR have an average length of 21.30 residues and a standard deviation of 2.56 residues. The distribution is sharper than a Gaussian distribution, with 60% of the transmembrane segments having a length of 21 residues and 94% having a length between 17 and 25 residues. A simple approximation of the curve is given by the function
|
Calculation of amino acid residue transmembrane propensities (potentials)
A propensity for each residue to be in a transmembrane region was calculated using the equation
|
Evaluation of the `hydrophobicity' of a sequence of residues
Following a similar, but not identical, definition put forward by Sipos and von Heijne (1993), the table of transmembrane propensities was translated into a new, statistically based, `hydrophobicity' scale defined by
|
The `hydrophobicity' of a sequence of residues from position m to position p is evaluated by
|
Calculation of favourable terminal (end) configurations of transmembrane regions
Favourable configurations are computed for decapeptides centred at the border of transmembrane regions (five residues outside and five residues inside the membrane). Positions in the decapeptide are counted from 0 to 9. For the N-terminal end (side), thereafter also referred to as `left end', position 0 corresponds to a residue five residues before the first amino acid residue of the transmembrane segment and position 9 corresponds to a residue four residues after this residue. For the C-terminal end (side), thereafter also referred to as `right end', position 0 corresponds to a residue five residues after the last amino acid residue of the transmembrane segment and position 9 corresponds to a residue four residues before this residue (Figure 1).
|
|
For the N-terminal (`left') side of a transmembrane segment, the propensity Ppleft of an amino acid residue, at position p in the sequence, to be the first one in the lipid-associated structure (the first residue of the transmembrane domain) is defined by the equation
|
Similarly, for the C-terminal side (`right') of a transmembrane segment, the propensity for an amino acid at position p to be the first residue outside the transmembrane region is defined by
|
However, using only Pleft propensities to find good `left' configurations (or Pright to find `right' configurations) is not sufficient. Some decapeptides can indeed generate high scores for both `left' and `right' propensities. We have, for example, to discard decapeptides such as `ILFVSTFFTM' which give a good value for Pleft of 1.75 and a high value for Pright of 2.61.
By looking at the Pleft and Pright values for known transmembrane segments, we found that the scores themselves are less important than the difference between `left' and `right' values.
We combined both propensities to obtain start and end indicators of transmebrane segments using the equations
|
Scoring of transmembrane regions
A well defined transmembrane region should give good scores for all three parameters (LeftInd, RightInd and H). However, when applied to known transmembrane segments, a large proportion scored small values for one or two of these indicators. In most cases, weak indicators are compensated by excellent values obtained for the remaining one(s).
High values can also be obtained for very short or very long segments. These segments of improbable length should be discarded unless the configuration is very clear (when high values are obtained for all three indicators).
We introduce in the scoring formula a negative indicator, which performs a filtering of the probable transmembrane segments depending on their length. This is calculated with
|
Each of the four indicators should contribute with the same weight in the evaluation of the score for a segment. After normalization of the hydrophobicity parameter, the score of a sequence from m to p is calculated by
|
Prediction algorithm
For each position m in the sequence, the maximum score that can be obtained if this position corresponds to the beginning of a transmembrane region is calculated as
|
For each position, the MScorem obtained and the corresponding end position are memorized. In the table generated, the highest MScorem is selected and the corresponding region is marked as transmembrane. Then, the second highest Mscorem that does not overlap with a previously marked region is selected and this process is continued with the next Mscorem, until all possible regions are found.
As an example, consider the table of MScorem obtained for the segment from residue 276 to residue 325 of 5HT3_MOUSE (Table I). In this table, the program selects the highest MScorem (89 at position 307) and marks the segment from 307 to 324 as transmembrane. Then, it selects the second possible highest Mscorem; 80 at position 310 cannot be selected because this position is part of the first selected transmembrane domain. Also, 69 at position 303 cannot be selected because it represents a segment that ends at position 321, inside the transmembrane domain. The next possible MScorem is 34, at position 282, that represents a transmembrane segment from residue 282 to residue 303. As it is not possible to select a third segment, the program ends. For this region of the protein with observed (putative) transmembrane segments at 278296 and 306324, the algorithm detects two transmembrane domains at 282303 and 307324.
|
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
We optimized the hydrophobicity indicator cut-off on a sub-set of 64 proteins of the set used by Rost et al. (1995) (the sequences 2MLT, GLRA_RAT, GPLB_HUMAN, IGGB_ STRSP and PT2M_ECOLI which were not found in the public databases were not used). The best results were obtained when segments with NHmp <2 were discarded. On the set of 64 proteins, an agreement factor of 88.24% was obtained, with a correlation coefficient of 0.79 and a ratio of segment matches of 0.945.
In order to test the PRED-TMR algorithm, we collected all available sequences used in three recent papers (Rost et al., 1995, 1996
; Cserzo et al., 1997
) and discarded those with more than 25% homology. The resulting set contains 101 non-homologous transmembrane proteins in total. Details of the results obtained are not shown here, but they can be downloaded together with the list of the transmembrane segment assignments from http://o2.db.uoa.gr/PRED-TMR/Results/.
The results of the test on this set of 101 proteins gave an average Q of 88.83%, a C of 0.80 and a ratio of segment matches, SM, of 0.954. One protein (1%) has a correlation coefficient <0.4 and 10 have C < 0.6 (10%). These scores are similar to those obtained by excluding the proteins used for the optimization of the hydrophobicity indicator cut-off (Q = 87.81%, C = 0.78 and SM = 0.943).
Table II shows the results produced applying PRED-TMR and five other prediction methods on the set of 101 proteins. Looking at the correlation coefficient, PRED-TMR was found to perform slightly better than the two best methods, PHDhtm and tmPRED, on this set. Concerning the agreement factor, PRED-TMR performs in a similar way to tmPRED and TOPPRED, whereas for the ratio of segment matches it is slightly worse than PHDhtm, which is best.
|
SwissProt, release 35, contains 9392 transmembrane sequences with a total of 40 672 transmembrane regions. We did not discard the test transmembrane segments with uncertain end-points as we did to establish the statistics. The PRED-TMR algorithm applied to all proteins contained in the SwissProt database produces slightly lower values for the Q and C scores and a larger decrease of the ratio of segment matches (Q = 86.14, C = 0.73, SM = 0.889) relative to the test set of 101 proteins mentioned above. Of the 9392 proteins, 1710 (18%) have C < 0.6.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Since PRED-TMR is a very fast algorithm and requires only information contained in a protein sequence alone, it is predicted that its most potential use will be its application to ORFs (Open Reading Frames) predicted by the various genome projects and especially those ORFs that correspond to proteins with unknown function. Aided by a pre-processing stage which could identify whether the sequence under study pertains to a membrane protein, it will be useful in the recognition of transmembrane domains. Such a pre-processing stage is well under way in our laboratory (C.Pasquier and J.S.Hamodrakas, in preparation). It is a neural network-based system which classifies proteins into four classes: fibrous (structural), globular, mixed (fibrous and globular) and membrane. The PRED-TMR algorithm has already been applied to the ORFs predicted from two genome projects and these results are currently being studied in detail.
PRED-TMR can certainly be improved by selecting carefully a representative and reliable set of transmembrane proteins to build the different tables. Ambiguities and errors in the existing databases impose limitations to its accuracy. When the statistical parameters used in the scoring formula were derived from the set of the 64 proteins, which were used to optimize the hydrophobicity cut-off, instead of calculating them from the entire SwissProt database, the accuracy scores decrease if the PRED-TMR algorithm is applied to sets larger than the original set of the 64 proteins. This is certainly due to the small reference set and reflects some special features of its sequences. However, it is believed that the most promising way to improve the accuracy of prediction is to alter the scoring formula. Indeed, it was found that the length penalty used is not the most appropriate because it handicaps too harshly segments with a length outside the [1725] range. Several other parameters can be added to the scoring formula such as the positive inside rule defined by von Heijne (1992). However, we are convinced that this kind of algorithm will always be limited by the problem of using a strict cut-off to the hydrophobicity indicator. Fuzzy logic seems to be a good technique to overcome this limitation by introducing some haziness in decision making.
A WWW server running the PRED-TMR algorithm is available at http://o2.db.uoa.gr/PRED-TMR/.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Chou,P.Y. and Fasman,G.D. (1978) Adv. Enzymol., 47, 45148.[Medline]
Cserzo,K., Wallin,E., Simon,I., von Heijne,G. and Elofsson,A. (1997) Protein Engng, 10, 673676.[Abstract]
Fisher,R.A. (1958). Statistical Methods for Research Workers. 13th edn. Hafner, New York, p. 183.
Jones,D.T., Taylor,W.R and Thornton,J.M. (1994) Biochemistry, 33, 30383049.[ISI][Medline]
Matthews,B.W. (1975) Biochim. Biophys. Acta, 405, 442451.[ISI][Medline]
Persson,B. and Argos,P. (1994) J. Mol. Biol., 237, 182192.[ISI][Medline]
Rost,B. and Sander,C. (1999). In Webster D.M. (ed.), Predicting Protein Structure. Humana Press, Clifton, NJ, in press. http://www.embl-heidelberg.de/rost/Papers/98revSecStr.html.
Rost,B., Casadio,R., Fariselli,P. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[ISI][Medline]
Rost,B., Casadio,R., Fariselli,P. and Sander,C. (1995) Protein Sci., 4, 521533.
Rost,B., Fariselli,P. and Casadio,R. (1996) Protein Sci., 5, 17041718.
Sipos,L. and von Heijne,G. (1993) Eur. J. Biochem., 213, 13331340.[Abstract]
von Heijne,G. (1992) J. Mol. Biol., 225, 487494.[ISI][Medline]
Received September 29, 1998; revised January 22, 1999; accepted January 26, 1999.