Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, MI 49007-4940, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: amino acid composition/bioinformatics/covariant discriminant/organelles/subcellular compartments
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In a pioneering study, Nakashima and Nishikawa (1994) proposed an algorithm to discriminate between intracellular and extracellular proteins by amino acid composition and residue-pair frequencies. In their method, the training set consisted of 894 proteins, of which 649 were intracellular and 245 extracellular; the testing set consisted of 379 proteins, of which 225 were intracellular and 154 extracellular. Recently, Cedano et al. (1997) extended the discriminative classes from two to five, i.e. extracellular, integral membrane, anchored membrane, intracellular and nuclear. This represents remarkable progress in this area. Furthermore, in an attempt to improve the prediction quality of protein cellular location, they proposed an algorithm called ProtLock. The idea of predicting the cellular location of a protein according to its amino acid composition alone, as done in ProtLock, is actually stimulated by the encouraging results of structural class prediction, where the only input is also the amino acid composition (see, e.g., P.Y.Chou, 1980, 1989
; Nakashima et al., 1986
; K.C.Chou, 1995
; Chou and Zhang, 1995
). An analysis in an attempt to understand the correlation of the structural class and subcellular location of a protein with its amino acid composition was recently given by Bahar et al. (1997) and Andrade et al. (1998), respectively.
Approaching the problem in a different way, Nakai and Kanehisa (1992) and Claros et al. (1997) proposed to predict the cellular location of proteins based on their N-terminal sorting signals. Obviously, these algorithms rely strongly on the existence of leader sequences. However, as pointed out recently by Reinhardt and Hubbard (1998), `In large genome analysis projects genes are usually automatically assigned and these assignments are often unreliable for the 5'-regions'. `This can lead to leader sequences being missing or only partially included, thereby causing problems for prediction algorithms depending on them'. Therefore, a method based on the amino acid composition would be more useful in practical applications.
As stated in the paper by Cedano et al. (1997), the ProtLock algorithm is mainly based on the procedure reported by Chou and Zhang (1995) for the prediction of protein structural classes according to Mahalanobis distances. Since the least Mahalanobis distance algorithm (K.C.Chou, 1995; Chou and Zhang, 1995
) is valid only when the training subset sizes are the same or approximately the same or poor predictions will otherwise result (Chou et al., 1998
; Chou and Maggiora, 1988), in the ProtLock algorithm the training set for each class was chosen to contain the same number of proteins. However, as shown later, when the cellular protein classification is conducted at a deeper level, it is found that proteins located in some organelles are much more abundant in the SWISS-PROT databank than in others. Besides, for a real cell the number of cellular locations is much greater than five considered by Cedano et al. (1997). For example, the number of proteins described as being located in a nucleus is much greater than that in a lysosome, and the number of proteins in cytoplasm is much greater than that in a Golgi apparatus. In view of this, can we develop an algorithm to predict effectively the locations of proteins in cells at a much more discriminative level? The current study was initiated in an attempt to solve this problem.
![]() |
Location classification |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
For the convenience of further study or practical application, the names of the 2319 proteins in S12 are listed in Appendix A, from which the datasets S7 and S5 can also be easily obtained. In this study, the datasets S12, S7 and S5 were used as the training datasets to predict the subcellular location of a protein among the 12, seven and five categories of classification, respectively. Owing to limitations on space, the protein names in the datasets 12,
7 and
5 are not given here, but they are available upon request.
![]() |
Prediction algorithm |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Suppose there are N proteins forming a set S, which is the union of m subsets, i.e.
|
The size of each subset is given by , where n
represents the number of proteins in the subset G
.
Obviously, N = . For example, for the dataset in
Appendix A, we have m = 12, n1 = 154, n2 = 592, . . ., n11 = 758, n12 = 25 and N = 2319.
The prediction algorithm is established based on the correlation between the subcellular location of a protein and its amino acid composition. Suppose the 20 amino acids are ordered alphabetically according to their single-letter codes: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. Thus, any protein in S will correspond to a vector or a point in the 20-D (dimensional) space, i.e. it can be described by (K.C.Chou, 1995)
|
where are the normalized occurrence frequencies of the 20 amino acids in the kth protein
of the subset G
. The standard vector for the subset G
is defined by
|
where
|
Suppose X is a protein whose cellular location is to be predicted. It can be either one of the N proteins in the set S or a protein outside it. It also corresponds to a point () in the 20-D space, where xi has the same meaning as
but is associated with protein X instead of
. Hence, the current algorithm can be formulated as follows.
The similarity between the standard vector X and the protein X is characterized by the covariant discriminant, as defined by Liu and Chou (1998):
|
where the first term is the squared Mahalanobis distance between and X (Mahalanobis, 1936
; Pillai, 1985
; K.C.Chou, 1995
):
|
where C is the covariance matrix for subset G
, given by
|
the superscript T is the transposition operator and is the inverse matrix of C
. The matrix elements of
in Equation 7 are given by
|
Because the amino acid composition must be normalized, i.e. constrained by
|
we have (cf. Equation 8)
|
Therefore, C defined by Equation 8 is a singular matrix, and its inverse matrix
must be of divergence and meaninglessness. To overcome such a difficulty, one way is to reduce the amino acid composition space from 20-D to 19-D by removing any one of its 20 components, as described by K.C.Chou (1995). Another way is to use an eigenvalueeigenvector approach to calculate the Mahalanobis distance so as to avoid dealing with any inverse matrix. According to the eigenvalueeigenvector approach (Chou and Zhang, 1995
), Equation 6 can be written as
|
where , the eigenvalue, and
, the jth component of the eigenvector
, are given by the following equation:
|
The second term of Equation 5 reflects the difference of covariance matrices for different subsets, in which is the ith eigenvalue of the covariance matrix
, as defined by Equation 12. It can be proved (Appendix B) that for the covariance matrix C
as defined by Equation 8, there is no negative eigenvalue. Actually, owing to Equation 10, C
must have one eigenvalue, denoted by
, equalto zero (Chou and Zhang, 1995
); all the other 19 eigenvalues
are generally greater than zero. Incorporation of the term ln (
) into the discriminant function is important, especially when the subset sizes in the training dataset are much different (Chou et al., 1998
). It is due to the second term that the covariant discriminant F as defined by Equation 5 is no longer a distance because it does not satisfy the condition of
when
, and also it may have a negative value, obviously in conflict with the classical definition that a distance must satisfy positivity, symmetry and the triangular inequality. Accordingly, the prediction rule is formulated by
|
where can be 1, 2, 3, . . ., m, and the operator Min means taking the least one among those in the parentheses and the superscript
is the subcellular location predicted for the protein X. If there is a tie case,
is not uniquely determined, but that did not occur in our datasets.
The eigenvalueeigenvector approach and the 19-D space approach should give the same results. It is instructive to point out that, if using the 19-D space approach, the covariant discriminant value as defined by Equation 5 will be the same regardless of which one of the 20 amino acid components is left out for constructing a 19-D space. This can be elucidated as follows. The covariant discriminant of Equation 5 consists of two terms. The first term is the squared Mahalanobis distance and its invariability has already been proved by a theorem given by K.C.Chou (1995). The second term is a logarithm, and its argument is actually equal to the determinant value of the matrix obtained by deleting the 20th row and 20th column from the matrix C. As shown by Equation A17 of K.C.Chou (1995), such a determinant value would remain the same regardless of which row and column were removed from C
as long as the removed row and column were the same in order. This indicates the invariability of the second term, and hence also the invariability of the covariant discriminant of Equation 5.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Listed in Table II are the self-consistency test results for discriminating the 12 subcellular locations of proteins in the dataset S12 (Appendix A) by using the covariant discriminant algorithm (Equation 13) and ProtLock algorithm (Cedano et al., 1997
), respectively. For a detailed prediction process by the current algorithm, see Appendix C, where the covariant discriminant values calculated according to Equation 5 for the 37 proteins in the cytoskeleton subset and their predicted results are given as a demonstration. As can be seen from Table II
, the overall rate of correct prediction by the current algorithm is 30% higher than that by the ProtLock algorithm (Cedano et al., 1997
). Similar calculations were also carried out for the dataset S7 and S5. Furthermore, a jackknife test by the current algorithm and the ProtLock algorithm was performed for each of these three datasets. The results obtained are summarized in Table III
, from which the following can be observed.
|
|
|
The current algorithm was also used to test the dataset studied by Nakai and Kanehisa (1991). From Gram-negative bacteria these authors extracted 106 proteins, of which 34 are inner membrane proteins, 21 periplasmic proteins, 22 outer membrane proteins and 29 cytoplasmic proteins (see Table 1 in Nakai and Kanehisa, 1991). According to their report, the self-consistency by using the expert system to predict the localization sites of the 106 proteins was 83%. No cross-validation was performed in their study. For the same database, when using the ProtLock algorithm (Cedano et al., 1997
), the corresponding rate was 85%. However, when using the current algorithm, the corresponding rate was 99%, further indicating its power.
To demonstrate its power further, the current algorithm was also used to test the dataset recently studied by Reinhardt and Hubbard (1998). After discarding those groups in which the amount of data available is too small for statistical analysis, these authors classified 997 prokaryotic proteins into three different subcellular locations: 688 cytoplasmic, 107 extracellular and 202 periplasmic proteins. Within each group none had >90% sequence identity with any other. According to their report, for such a dataset the rate of correct prediction by them using the neural network method for a subsampling test was 81%. This is the highest accuracy rate so far reported for a cross-validation test in protein cellular location prediction. Now for the same dataset, when using the discriminant function algorithm to perform prediction, we found that the rate of correct prediction was 91% by self-consistency test and 86% by jackknife test; both are considerably higher than 81%. Further, in their subsampling procedure, only a very small fraction of the possible divisions were investigated (Chou and Elrod, 1998), and the results thus obtained would certainly bear considerable arbitrariness. Actually, compared with the limited subsampling test, the jackknife test is much more objective and rigorous (Mardia, 1979
). Accordingly, from both the percentage of correct prediction and the rationality of cross-validation, a higher prediction quality can be obtained by using the current algorithm.
That the current algorithm can lead to the best prediction quality is because it takes into account the coupling effect among different amino acid components, which is a kind of collective interaction, as formulated by a set of covariance matrices in Equation 7, C(
= 1, 2, . . ., m), that is the core of the current algorithm. It is through each of these matrices that a more reasonable statistical distance (K.C.Chou, 1995
; Chou and Zhang, 1995
), the Mahananobis distance, in the amino acid composition space is defined (see the first term of Equation 5), and it is through the eigenvalues of these matrices that the coupling effects in different subsets as well as their sizes are reflected (see the second term of Equation 5). It should be pointed out that although the ProtLock algorithm (Cedano et al., 1997
) also contained a covariance matrix, it did not reflect the special character for each of the individual subsets. Particularly, in the ProtLock algorithm, a critical term, i.e. the second term of Equation 5, was completely missed. For a detailed discussion of this aspect, see Appendix D, where two important differences between the current algorithm and ProtLock are illustrated.
To show the difference in amino acid compositions that distinguish the subcellular locations of proteins, the 20-D standard vector derived from the proteins in the training dataset of Appendix A for each of the 12 subcellular locations is given in Table IV. Further, to provide an intuitive picture, each such 20-D standard vector is projected on to a 2-D radar diagram as given in Figure 2
. In addition, the 19 positive eigenvalues for each of the 12 corresponding covariance matrices (see Equations 7 and 12) are given in Table V
that might be of use for investigating the component-coupled effects at a deeper level, especially for understanding the important contribution from the second term of Equation 5 as illustrated in Figure 3
. This is a vitally important term for dealing with the case where the sizes of subsets are different. However, such an important term and also the denominator n
1 in Equation 8 were not included in the original least Mahalanobis distance algorithm (K.C.Chou, 1995
), although good results were still obtained because the case studied there consisted of subsets with the same size. It is very important to realize this, otherwise the prediction algorithm might be misused, leading to poor results and an incorrect conclusion, as elaborated in a recent paper (Chou et al., 1998
).
|
|
|
|
The idea of predicting the subcellular location of a protein according to its amino acid composition is based on the following rationale. (i) Different compartments of a cell usually have different physio-chemical environments which might be very sensitive in selectively accommodating a protein according to its structural feature, particularly its surface physical chemistry character. (ii) The structural class of a protein, one of the most basic structural features, is correlated with its amino acid composition, as reflected by many encouraging reports of predicting the former based on the latter alone (see, e.g., P.Y.Chou, 1980; Klein and Delisi, 1986
; Nakashima et al., 1986
; K.C.Chou, 1995
; Chou and Zhang, 1995
; Bahar et al., 1997
). (iii) The character of a protein surface, which is directly exposed to the environment of a cellular compartment, is also very likely correlated with the amino acid composition because it is determined by a sequence-folding process during which the interaction among different amino acid components might also play an important role. (iv) The above correlations suggest that the total amino acid composition might carry a `signal' that identifies the subcellular location. (v) Compared with the existing algorithms, the covariant discriminant algorithm proposed in this paper can give the best prediction quality for the protein subcellular location.
Appendix A
List of the 2319 proteins located in 12 different subcellular locations, with codes according to the SWISS-PROT data bank
Appendix B
For the reader's convenience, let us prove that the covariance matrix C as defined by Equations 7 and 8 has no negative eigenvalues.
Suppose
|
where S is a
matrix consisting of the n
vectors of Equation 2 and e is the n
-dimensional column vector with all components equal to 1. Then we have
|
Suppose
|
is any real vector in the 20-D composition space. Left and right multiplying both sides of Equation B2 by yT and y, respectively, we can obtain
|
Suppose is an eigenvector of C
, i.e.
|
where is the corresponding eigenvalue. Left multiplying both sides of the above equation by
T, we can obtain
|
Because Equation B4 and the fact that an eigenvector is a non-zero vector, it follows that
|
This completes the proof.
Appendix C
Covariant discriminant values computed according to Equation 5 for the 37 proteins in the cytoskeleton subset of the dataset S12 (see Appendix A) and the subcellular location predicted for each of these proteins according to Equation 13
Appendix D
Although the coupling effects among different amino acid components are taken into account by both the ProtLock algorithm (Cedano et al., 1977) and the current algorithm via a covariance matrix, there are two important differences between these two.
Difference in covariance matrix
Rather than C as defined by Equations 7 and 8, the covariance matrix in the ProtLock algorithm was given by
|
where
|
where
|
Comparing Equation D1 with Equation 7, Equation D2 with Equation 8 and Equation D3 with Equation 4, one can easily see that there was only one covariance matrix C in ProtLock that was defined for the entire set S, rather than each of the m subsets having its own covariance matrix C
. Accordingly, the Mahananobis distance defined in ProtLock is a simplified form of the genuine Mahalanobis distance. This will certainly make the ProtLock algorithm lose some power in discriminating entries from different subsets.
It is instinctive to point out that the covariance matrix (Equation D1) given by Cedano et al. (1997) was defined in a 20-D space rather than 19-D space as originally formulated by K.C.Chou (1995). As mentioned in the prediction algorithm section, this would lead to a divergent difficulty when calculating the Mahalanobis distance in terms of the inverse matrix of C unless the user understood the use of the eigenvalueeigenvector approach as described in this paper to avoid such a difficulty.
Difference in discriminative criterion The prediction in ProtLock was based on Mahananobis distance as defined by
|
In contrast, the prediction in the current algorithm is based on the covariant discriminant function given by Equation 5. A comparison of Equation 5 with Equation D4 indicates that the contribution from the term , which reflects the difference of the covariance matrices C
for different classes, was completely ignored in the ProtLock algorithm. This will further weaken the power of discriminativity.
![]() |
Acknowledgments |
---|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Andrade,M.A., O'Donoghue,S.I. and Rost,B. (1998) J. Mol. Biol., 276, 517525.[ISI][Medline]
Bahar,I., Atilgan,A.R., Jernigan,R.L. and Erman,B. (1997) Proteins, 29, 172185.[ISI][Medline]
Bairoch,A. and Apweiler,R. (1997) Nucleic Acids Res., 25, 3136.
Cedano,J., Aloy,P., Pérez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594600.[ISI][Medline]
Chou,K.C. (1995) Proteins: Struct. Funct. Genet., 21, 319344.[ISI][Medline]
Chou,K.C. and Elrod,D.W. (1998) Biochem. Biophys. Res. Commun., 252, 6368.[ISI][Medline]
Chou,K.C. and Maggiora,G.M. (1998) Protein Engng, 11, 523538.[Abstract]
Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275349.[Abstract]
Chou,K.C., Liu,W., Maggiora,G.M. and Zhang,C.T. (1998) Proteins: Struct. Funct. Genet., 31, 97103.[Medline]
Chou,P.Y. (1980) In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.
Chou,P.Y. (1989) In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549586.
Claros,M.G., Brunak,S. and von Heijne,G. (1997) Curr. Opin. Struct. Biol., 7, 394398.[ISI][Medline]
Klein,P. and Delisi,C. (1986) Biopolymers, 25, 15691672.
Liu,W. and Chou,K.C. (1998) J. Protein Chem., 17, 209217.[ISI][Medline]
Lodish,H., Baltimore,D., Berk,A., Zipursky,S.L., Matsudaira,P. and Darnell,J. (1995) Molecular Cell Biology, 3rd edn. Scientific American Books, New York, Ch. 3.
Mahalanobis,P.C. (1936) Proc. Natl Inst. Sci. India, 2, 4955.
Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis. Academic Press, London, pp. 322 and 381.
Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 5461.[ISI][Medline]
Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152162.
Nakai,K. and Kanehisa,M. (1991) Proteins: Struct. Funct. Genet., 11, 95110.[ISI][Medline]
Nakai,K. and Kanehisa,M. (1992) Genomics, 14, 897911.[ISI][Medline]
Pillai,K.C.S. (1985) In Kotz,S. and Johnson,N.L. (eds), Encyclopedia of Statistical Sciences, Vol. 5. Wiley, New York, pp. 176181.
Reinhardt,A. and Hubbard,T. (1998) Nucleic Acids Res., 26, 22302236.
Received July 28, 1998; revised October 16, 1998; accepted October 21, 1998.