Protein subcellular location prediction

Kuo-Chen Chou1 and David W. Elrod

Computer-Aided Drug Discovery, Pharmacia & Upjohn, Kalamazoo, MI 49007-4940, USA


    Abstract
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
The function of a protein is closely correlated with its subcellular location. With the rapid increase in new protein sequences entering into data banks, we are confronted with a challenge: is it possible to utilize a bioinformatic approach to help expedite the determination of protein subcellular locations? To explore this problem, proteins were classified, according to their subcellular locations, into the following 12 groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole. Based on the classification scheme that has covered almost all the organelles and subcellular compartments in an animal or plant cell, a covariant discriminant algorithm was proposed to predict the subcellular location of a query protein according to its amino acid composition. Results obtained through self-consistency, jackknife and independent dataset tests indicated that the rates of correct prediction by the current algorithm are significantly higher than those by the existing methods. It is anticipated that the classification scheme and concept and also the prediction algorithm can expedite the functionality determination of new proteins, which can also be of use in the prioritization of genes and proteins identified by genomic efforts as potential molecular targets for drug design.

Keywords: amino acid composition/bioinformatics/covariant discriminant/organelles/subcellular compartments


    Introduction
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
Given the sequence of a protein, how can its cellular location and biological function be determined? This is a problem vitally important to both cell biologists and bioinformatists today. Since the number of sequences entering into data banks has been rapidly increasing, it is time consuming and costly to approach this problem entirely by performing various locational and functional experimental tests. For example, in the recent release 35.0 (November 1997) of SWISS-PROT (Bairoch and Apweiler, 1997Go), the number of sequence entries has reached 69 113, which represents an increase of 17.10% over release 34.0 (October 1996). In view of this, it is highly desirable to develop an algorithm for rapidly predicting the subcellular compartments in which a new protein sequence could be located.

In a pioneering study, Nakashima and Nishikawa (1994) proposed an algorithm to discriminate between intracellular and extracellular proteins by amino acid composition and residue-pair frequencies. In their method, the training set consisted of 894 proteins, of which 649 were intracellular and 245 extracellular; the testing set consisted of 379 proteins, of which 225 were intracellular and 154 extracellular. Recently, Cedano et al. (1997) extended the discriminative classes from two to five, i.e. extracellular, integral membrane, anchored membrane, intracellular and nuclear. This represents remarkable progress in this area. Furthermore, in an attempt to improve the prediction quality of protein cellular location, they proposed an algorithm called ProtLock. The idea of predicting the cellular location of a protein according to its amino acid composition alone, as done in ProtLock, is actually stimulated by the encouraging results of structural class prediction, where the only input is also the amino acid composition (see, e.g., P.Y.Chou, 1980Go, 1989Go; Nakashima et al., 1986Go; K.C.Chou, 1995Go; Chou and Zhang, 1995Go). An analysis in an attempt to understand the correlation of the structural class and subcellular location of a protein with its amino acid composition was recently given by Bahar et al. (1997) and Andrade et al. (1998), respectively.

Approaching the problem in a different way, Nakai and Kanehisa (1992) and Claros et al. (1997) proposed to predict the cellular location of proteins based on their N-terminal sorting signals. Obviously, these algorithms rely strongly on the existence of leader sequences. However, as pointed out recently by Reinhardt and Hubbard (1998), `In large genome analysis projects genes are usually automatically assigned and these assignments are often unreliable for the 5'-regions'. `This can lead to leader sequences being missing or only partially included, thereby causing problems for prediction algorithms depending on them'. Therefore, a method based on the amino acid composition would be more useful in practical applications.

As stated in the paper by Cedano et al. (1997), the ProtLock algorithm is mainly based on the procedure reported by Chou and Zhang (1995) for the prediction of protein structural classes according to Mahalanobis distances. Since the least Mahalanobis distance algorithm (K.C.Chou, 1995Go; Chou and Zhang, 1995Go) is valid only when the training subset sizes are the same or approximately the same or poor predictions will otherwise result (Chou et al., 1998Go; Chou and Maggiora, 1988), in the ProtLock algorithm the training set for each class was chosen to contain the same number of proteins. However, as shown later, when the cellular protein classification is conducted at a deeper level, it is found that proteins located in some organelles are much more abundant in the SWISS-PROT databank than in others. Besides, for a real cell the number of cellular locations is much greater than five considered by Cedano et al. (1997). For example, the number of proteins described as being located in a nucleus is much greater than that in a lysosome, and the number of proteins in cytoplasm is much greater than that in a Golgi apparatus. In view of this, can we develop an algorithm to predict effectively the locations of proteins in cells at a much more discriminative level? The current study was initiated in an attempt to solve this problem.


    Location classification
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
According to their subcellular locations, proteins are classified into the following 12 discriminative groups: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole (Figure 1Go). Such a classification covers almost all the organelles in an animal or plant cell (see, e.g., Alberts et al., 1994Go; Lodish et al., 1995Go). Note that the vacuole and chloroplast exist only in a plant cell. Membrane proteins such as transmembrane and anchored-membrane proteins actually reflect the protein types rather than subcellular locations. For example, a membrane protein can be associated with the membrane of endoplasmic reticulum, Golgi apparatus, lysosome or any other organelle enveloped by a lipid bilayer structure. Therefore, if associated with endoplasmic reticulum, the membrane protein is located at the endoplasmic reticulum; if associated with the Golgi apparatus, it is located at the Golgi apparatus; and so forth. Plasma membrane proteins are located at the cell envelope (Figure 1Go).



View larger version (151K):
[in this window]
[in a new window]
 
Fig. 1. Schematic diagram showing the subcellular locations of proteins. For simplification, indices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 are used to represent chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi, lysosome, mitochondria, nucleus, peroxisome, plasma membrane and vacuole, respectively. Note that the vacuole and chloroplast proteins exist only in a plant cell.

 
The classification was based on release 35.0 of SWISS-PROT (Bairoch and Apweiler, 1997Go). In order to obtain a high-quality, well defined training set, the data were screened strictly according to the following procedures:

  1. Included are only those sequences with clear locational descriptions; those with ambiguous or uncertain words such as `location unspecified', `probable', `potential' and `by similarity' were omitted.
  2. Sequences annotated by two or more locations are not included because of a lack of uniqueness. For example, a protein sequence labeled with `Golgi and nuclear' or `chloroplast or mitochondria' was omitted. Also note that secreted proteins should be assigned to the extracellular group and proteins annotated with `microtubule' or `filament' should be assigned to the cytoskeletal group (Alberts et al., 1994Go).
  3. For protein sequences with the same name but from different species, only one of them was included. After the above screening procedures we obtained a dataset, S12, of 12 categories that contains 2319 protein sequences, of which 154 are chloroplast proteins, 592 cytoplasmic, 37 cytoskeletal, 53 endoplasmic reticulum, 230 extracellular, 26 Golgi apparatus, 38 lysosomal, 86 mitochondrial, 288 nuclear, 32 peroxisomal, 758 plasma membrane and 25 vacuoles (column 2 of Table IGo).

    View this table:
    [in this window]
    [in a new window]
     
    Table I. Breakdown of the datasets used in this study
     
  4. In order to observe the impact of the number of subcellular locations considered on the prediction rate, two more datasets were constructed. These two datasets are S7 and S5 (columns 4 and 6 of Table IGo, respectively), which were obtained by simply removing the small subsets from S12. The datasets S7 was derived from S12 by removing the cytoskeleton, Golgi apparatus, lysosome, peroxisome and vacuole subsets, none of which contains more than 50 proteins in S12. The dataset S5 was derived from S7 by further removing endoplasmic reticulum and mitochondrial subsets, none of which contains more than 100 proteins in S12.
  5. In order to test the consistency, three corresponding independent datasets were constructed. They are S12, S7 and 5 (columns 3, 5 and 7 of Table IGo, respectively), none of which contains a protein that occurs in the datasets S12, S7 and S5.

For the convenience of further study or practical application, the names of the 2319 proteins in S12 are listed in Appendix A, from which the datasets S7 and S5 can also be easily obtained. In this study, the datasets S12, S7 and S5 were used as the training datasets to predict the subcellular location of a protein among the 12, seven and five categories of classification, respectively. Owing to limitations on space, the protein names in the datasets 12, 7 and 5 are not given here, but they are available upon request.


    Prediction algorithm
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
For brevity, let us use indices 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12 to represent chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane and vacuole, respectively. We use G1 to represent the chloroplast subset consisting of only chloroplast proteins, G2 to represent the cytoplasm subset consisting of only cytoplasmic proteins, and so forth.

Suppose there are N proteins forming a set S, which is the union of m subsets, i.e.


The size of each subset is given by , where n{xi} represents the number of proteins in the subset G{xi}.

Obviously, N = . For example, for the dataset in

Appendix A, we have m = 12, n1 = 154, n2 = 592, . . ., n11 = 758, n12 = 25 and N = 2319.

The prediction algorithm is established based on the correlation between the subcellular location of a protein and its amino acid composition. Suppose the 20 amino acids are ordered alphabetically according to their single-letter codes: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y. Thus, any protein in S will correspond to a vector or a point in the 20-D (dimensional) space, i.e. it can be described by (K.C.Chou, 1995Go)


where are the normalized occurrence frequencies of the 20 amino acids in the kth protein of the subset G{xi}. The standard vector for the subset G{xi} is defined by


where


Suppose X is a protein whose cellular location is to be predicted. It can be either one of the N proteins in the set S or a protein outside it. It also corresponds to a point () in the 20-D space, where xi has the same meaning as but is associated with protein X instead of . Hence, the current algorithm can be formulated as follows.

The similarity between the standard vector X{xi} and the protein X is characterized by the covariant discriminant, as defined by Liu and Chou (1998):


where the first term is the squared Mahalanobis distance between {xi} and X (Mahalanobis, 1936Go; Pillai, 1985Go; K.C.Chou, 1995Go):


where C{xi} is the covariance matrix for subset G{xi}, given by


the superscript T is the transposition operator and is the inverse matrix of C{xi}. The matrix elements of in Equation 7 are given by


Because the amino acid composition must be normalized, i.e. constrained by


we have (cf. Equation 8)


Therefore, C{xi} defined by Equation 8 is a singular matrix, and its inverse matrix must be of divergence and meaninglessness. To overcome such a difficulty, one way is to reduce the amino acid composition space from 20-D to 19-D by removing any one of its 20 components, as described by K.C.Chou (1995). Another way is to use an eigenvalue–eigenvector approach to calculate the Mahalanobis distance so as to avoid dealing with any inverse matrix. According to the eigenvalue–eigenvector approach (Chou and Zhang, 1995Go), Equation 6 can be written as


where , the eigenvalue, and , the jth component of the eigenvector , are given by the following equation:


The second term of Equation 5 reflects the difference of covariance matrices for different subsets, in which is the ith eigenvalue of the covariance matrix , as defined by Equation 12. It can be proved (Appendix B) that for the covariance matrix C{xi} as defined by Equation 8, there is no negative eigenvalue. Actually, owing to Equation 10, C{xi} must have one eigenvalue, denoted by , equalto zero (Chou and Zhang, 1995Go); all the other 19 eigenvalues are generally greater than zero. Incorporation of the term ln () into the discriminant function is important, especially when the subset sizes in the training dataset are much different (Chou et al., 1998Go). It is due to the second term that the covariant discriminant F as defined by Equation 5 is no longer a distance because it does not satisfy the condition of when , and also it may have a negative value, obviously in conflict with the classical definition that a distance must satisfy positivity, symmetry and the triangular inequality. Accordingly, the prediction rule is formulated by


where {lambda} can be 1, 2, 3, . . ., m, and the operator Min means taking the least one among those in the parentheses and the superscript {lambda} is the subcellular location predicted for the protein X. If there is a tie case, {lambda} is not uniquely determined, but that did not occur in our datasets.

The eigenvalue–eigenvector approach and the 19-D space approach should give the same results. It is instructive to point out that, if using the 19-D space approach, the covariant discriminant value as defined by Equation 5 will be the same regardless of which one of the 20 amino acid components is left out for constructing a 19-D space. This can be elucidated as follows. The covariant discriminant of Equation 5 consists of two terms. The first term is the squared Mahalanobis distance and its invariability has already been proved by a theorem given by K.C.Chou (1995). The second term is a logarithm, and its argument is actually equal to the determinant value of the matrix obtained by deleting the 20th row and 20th column from the matrix C{xi}. As shown by Equation A17 of K.C.Chou (1995), such a determinant value would remain the same regardless of which row and column were removed from C{xi} as long as the removed row and column were the same in order. This indicates the invariability of the second term, and hence also the invariability of the covariant discriminant of Equation 5.


    Results and discussion
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
The prediction quality was examined by two test methods, the self-consistency test and the jackknife test. In the self-consistency test, the subcellular location for each of the proteins in a given dataset was predicted using the rules derived from the same dataset, the so-called development dataset or training dataset. In the jackknife test, each protein in the training dataset was singled out in turn as a `test protein' and all the rule parameters were determined from the remaining N 1 proteins. Jackknife tests are thought one of the most effective and objective methods for cross-validation in statistics (Mardia et al., 1979Go).

Listed in Table IIGo are the self-consistency test results for discriminating the 12 subcellular locations of proteins in the dataset S12 (Appendix A) by using the covariant discriminant algorithm (Equation 13) and ProtLock algorithm (Cedano et al., 1997Go), respectively. For a detailed prediction process by the current algorithm, see Appendix C, where the covariant discriminant values calculated according to Equation 5 for the 37 proteins in the cytoskeleton subset and their predicted results are given as a demonstration. As can be seen from Table IIGo, the overall rate of correct prediction by the current algorithm is 30% higher than that by the ProtLock algorithm (Cedano et al., 1997Go). Similar calculations were also carried out for the dataset S7 and S5. Furthermore, a jackknife test by the current algorithm and the ProtLock algorithm was performed for each of these three datasets. The results obtained are summarized in Table IIIGo, from which the following can be observed.


View this table:
[in this window]
[in a new window]
 
Table II. Self-consistency test results for the 2319 proteins in Appendix A

 

View this table:
[in this window]
[in a new window]
 
Table III. Overall rates of correct prediction by self-consistency, jackknife and independent dataset tests

 
  1. The overall rates of correct prediction obtained by the current algorithm using the jackknife and self-consistency tests for dataset S12 were 68.4 and 79.9%, respectively. Imagine: if the samples of proteins are completely randomly assigned among m possible subsets, the rate of correct assignment would generally be 1/m; if the random assignment is weighted according to the sizes of subsets, then the rate of correct prediction would be , where


    Hence the correct rate by a completely random assignment for a classification of 12 categories would be 1/12 {approx} 8.3%, and the corresponding rate by the weighted random assignment would be , provided one uses the number of proteins in each subcellular location as given in Appendix A to represent the size of each subset. Therefore, the rates of correct prediction obtained by using the covariant discriminant algorithm in both the self-consistency and jackknife tests are much higher than the corresponding completely randomized rate and weighted randomized rate, implying that the cellular location of a protein is considerably correlated with its amino acid composition.

  2. When the number of subcellular locations considered was reduced from 12 (S12) to seven (S7) and five (S5) by excluding small subsets (see Table IGo), the corresponding rates were increased to 73.1 and 80.0% and 78.3 and 83.1%, respectively. This indicates that the prediction quality can be substantially improved if one can (i) narrow down the scope of subcellular location for a query protein according to its source and other relevant information (e.g. if a query protein is from an animal organism, one can exclude the chloroplast and vacuole subsets from consideration and the prediction will be made among 10 possible subcellular locations instead of 12); and (ii) improve the training data of small subsets by adding into them more new proteins that have been found belonging to the locations defined by these subsets.
  3. As a demonstration of a practical application, predictions were also performed for the three independent datasets 12, 7 and 5 using the rule parameters derived from the datasets S12, S7 and S5, respectively. The overall rates of correct prediction thus obtained are also given in Table IIIGo, from which it can be seen that the rates of correct prediction by the current algorithm are in the range 75.9–81.8%, fully consistent with the results obtained by the self-consistency and jackknife tests.
  4. No matter whether the self-consistency test, the jackknife test or the independent dataset test is used, the overall rates of correct prediction obtained by the current algorithm are significantly higher than those obtained by the ProtLock algorithm (Cedano et al., 1997Go). For the case of five subcellular locations, the rates of correct predictions by the current algorithm are 8.8–13.6% higher, for seven subcellular locations 17.5–26.8% higher and for 12 subcellular locations 24.5–35.9% higher. The above data also clearly indicate that the greater the number of subcellular locations considered, the more significant the improvement of prediction quality would be by using the current algorithm. In other words, the covariant discriminant algorithm is particularly powerful when used to deal with a classification with many possible categories.
  5. The comparison of prediction quality was also extended to cover other algorithms, such as the least city-block distance algorithm (P.Y.Chou, 1980Go, 1989Go), and the least Euclidean algorithm (Nakashima et al., 1986Go). Both of these algorithms were developed for predicting the structural class of a protein according to its amino acid composition, and hence can be directly applied to predicting the protein subcellular locations based on the same datasets as used here. It was found that for the case of 12 subcellular locations, the overall rates of correct prediction by using the least city-block distance algorithm (P.Y.Chou, 1980Go, 1989Go) for the self-consistency, jackknife and independent dataset tests were 47.9, 46.4 and 45.4%, respectively, and the corresponding rates by the least Euclidean algorithm (Nakashima et al., 1986Go) were 48.1, 46.7 and 46.6%. Compared with these results, the overall rates of correct prediction by using the current algorithm are about 22–32% higher.

The current algorithm was also used to test the dataset studied by Nakai and Kanehisa (1991). From Gram-negative bacteria these authors extracted 106 proteins, of which 34 are inner membrane proteins, 21 periplasmic proteins, 22 outer membrane proteins and 29 cytoplasmic proteins (see Table 1 in Nakai and Kanehisa, 1991Go). According to their report, the self-consistency by using the expert system to predict the localization sites of the 106 proteins was 83%. No cross-validation was performed in their study. For the same database, when using the ProtLock algorithm (Cedano et al., 1997Go), the corresponding rate was 85%. However, when using the current algorithm, the corresponding rate was 99%, further indicating its power.

To demonstrate its power further, the current algorithm was also used to test the dataset recently studied by Reinhardt and Hubbard (1998). After discarding those groups in which the amount of data available is too small for statistical analysis, these authors classified 997 prokaryotic proteins into three different subcellular locations: 688 cytoplasmic, 107 extracellular and 202 periplasmic proteins. Within each group none had >90% sequence identity with any other. According to their report, for such a dataset the rate of correct prediction by them using the neural network method for a subsampling test was 81%. This is the highest accuracy rate so far reported for a cross-validation test in protein cellular location prediction. Now for the same dataset, when using the discriminant function algorithm to perform prediction, we found that the rate of correct prediction was 91% by self-consistency test and 86% by jackknife test; both are considerably higher than 81%. Further, in their subsampling procedure, only a very small fraction of the possible divisions were investigated (Chou and Elrod, 1998Go), and the results thus obtained would certainly bear considerable arbitrariness. Actually, compared with the limited subsampling test, the jackknife test is much more objective and rigorous (Mardia, 1979Go). Accordingly, from both the percentage of correct prediction and the rationality of cross-validation, a higher prediction quality can be obtained by using the current algorithm.

That the current algorithm can lead to the best prediction quality is because it takes into account the coupling effect among different amino acid components, which is a kind of collective interaction, as formulated by a set of covariance matrices in Equation 7, C{xi}({xi} = 1, 2, . . ., m), that is the core of the current algorithm. It is through each of these matrices that a more reasonable statistical distance (K.C.Chou, 1995Go; Chou and Zhang, 1995Go), the Mahananobis distance, in the amino acid composition space is defined (see the first term of Equation 5), and it is through the eigenvalues of these matrices that the coupling effects in different subsets as well as their sizes are reflected (see the second term of Equation 5). It should be pointed out that although the ProtLock algorithm (Cedano et al., 1997Go) also contained a covariance matrix, it did not reflect the special character for each of the individual subsets. Particularly, in the ProtLock algorithm, a critical term, i.e. the second term of Equation 5, was completely missed. For a detailed discussion of this aspect, see Appendix D, where two important differences between the current algorithm and ProtLock are illustrated.

To show the difference in amino acid compositions that distinguish the subcellular locations of proteins, the 20-D standard vector derived from the proteins in the training dataset of Appendix A for each of the 12 subcellular locations is given in Table IVGo. Further, to provide an intuitive picture, each such 20-D standard vector is projected on to a 2-D radar diagram as given in Figure 2Go. In addition, the 19 positive eigenvalues for each of the 12 corresponding covariance matrices (see Equations 7 and 12) are given in Table VGo that might be of use for investigating the component-coupled effects at a deeper level, especially for understanding the important contribution from the second term of Equation 5 as illustrated in Figure 3Go. This is a vitally important term for dealing with the case where the sizes of subsets are different. However, such an important term and also the denominator n{xi} – 1 in Equation 8 were not included in the original least Mahalanobis distance algorithm (K.C.Chou, 1995Go), although good results were still obtained because the case studied there consisted of subsets with the same size. It is very important to realize this, otherwise the prediction algorithm might be misused, leading to poor results and an incorrect conclusion, as elaborated in a recent paper (Chou et al., 1998Go).


View this table:
[in this window]
[in a new window]
 
Table IV. The standard vector derived from the training dataset of Appendix A for each of the 12 protein subcellular locations

 


View larger version (67K):
[in this window]
[in a new window]
 
Fig. 2. Radar diagrams to show the difference of the 20-D standard vectors, i.e. the average amino acid compositions for the proteins in the following subcellular locations: (1) chloroplast, (2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria, (9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole. Amino acids are denoted by their single-letter codes (see Table IVGo).

 

View this table:
[in this window]
[in a new window]
 
Table V. The 19 positive eigenvalues of the covariance matrix derived from the training dataset of Appendix A for each of the 12 protein subcellular locations

 


View larger version (43K):
[in this window]
[in a new window]
 
Fig. 3. Histograms to show the contributions of ln({lambda}{xi}2 {lambda}{xi}3 {lambda}{xi}4 . . . {lambda}{xi}20) from different subsets to the covariant discriminant function of Equation 5. As can be seen, the heights of the 12 histograms are considerably different. Only when the heights are the same can the second term of Equation 5 be omitted from the prediction algorithm.

 
Conclusion

The idea of predicting the subcellular location of a protein according to its amino acid composition is based on the following rationale. (i) Different compartments of a cell usually have different physio-chemical environments which might be very sensitive in selectively accommodating a protein according to its structural feature, particularly its surface physical chemistry character. (ii) The structural class of a protein, one of the most basic structural features, is correlated with its amino acid composition, as reflected by many encouraging reports of predicting the former based on the latter alone (see, e.g., P.Y.Chou, 1980Go; Klein and Delisi, 1986Go; Nakashima et al., 1986Go; K.C.Chou, 1995Go; Chou and Zhang, 1995Go; Bahar et al., 1997Go). (iii) The character of a protein surface, which is directly exposed to the environment of a cellular compartment, is also very likely correlated with the amino acid composition because it is determined by a sequence-folding process during which the interaction among different amino acid components might also play an important role. (iv) The above correlations suggest that the total amino acid composition might carry a `signal' that identifies the subcellular location. (v) Compared with the existing algorithms, the covariant discriminant algorithm proposed in this paper can give the best prediction quality for the protein subcellular location.

Appendix A

List of the 2319 proteins located in 12 different subcellular locations, with codes according to the SWISS-PROT data bank

Appendix B

For the reader's convenience, let us prove that the covariance matrix C{xi} as defined by Equations 7 and 8 has no negative eigenvalues.

Suppose


where S{xi} is a matrix consisting of the n{xi} vectors of Equation 2 and e is the n{xi}-dimensional column vector with all components equal to 1. Then we have


Suppose


is any real vector in the 20-D composition space. Left and right multiplying both sides of Equation B2 by yT and y, respectively, we can obtain


Suppose {Psi} is an eigenvector of C{xi}, i.e.


where {lambda} is the corresponding eigenvalue. Left multiplying both sides of the above equation by {Psi}T, we can obtain


Because Equation B4 and the fact that an eigenvector is a non-zero vector, it follows that


This completes the proof.

Appendix C

Covariant discriminant values computed according to Equation 5 for the 37 proteins in the cytoskeleton subset of the dataset S12 (see Appendix A) and the subcellular location predicted for each of these proteins according to Equation 13

Appendix D

Although the coupling effects among different amino acid components are taken into account by both the ProtLock algorithm (Cedano et al., 1977) and the current algorithm via a covariance matrix, there are two important differences between these two.

Difference in covariance matrix Rather than C{xi} as defined by Equations 7 and 8, the covariance matrix in the ProtLock algorithm was given by


where


where


Comparing Equation D1 with Equation 7, Equation D2 with Equation 8 and Equation D3 with Equation 4, one can easily see that there was only one covariance matrix C in ProtLock that was defined for the entire set S, rather than each of the m subsets having its own covariance matrix C{xi}. Accordingly, the Mahananobis distance defined in ProtLock is a simplified form of the genuine Mahalanobis distance. This will certainly make the ProtLock algorithm lose some power in discriminating entries from different subsets.

It is instinctive to point out that the covariance matrix (Equation D1) given by Cedano et al. (1997) was defined in a 20-D space rather than 19-D space as originally formulated by K.C.Chou (1995). As mentioned in the prediction algorithm section, this would lead to a divergent difficulty when calculating the Mahalanobis distance in terms of the inverse matrix of C unless the user understood the use of the eigenvalue–eigenvector approach as described in this paper to avoid such a difficulty.

Difference in discriminative criterion The prediction in ProtLock was based on Mahananobis distance as defined by


In contrast, the prediction in the current algorithm is based on the covariant discriminant function given by Equation 5. A comparison of Equation 5 with Equation D4 indicates that the contribution from the term , which reflects the difference of the covariance matrices C{xi} for different classes, was completely ignored in the ProtLock algorithm. This will further weaken the power of discriminativity.


    Acknowledgments
 
Valuable discussions with Professor Ferenc J.Kézdy and Dr Reqiang Yan are gratefully acknowledged. The authors are also indebted to Dr Viv Junker for interpreting the annotations in the Swiss Protein Data Bank, to Dr Nakai and Dr A.Reinhardt for providing their datasets for testing the covariant discriminant algorithm and to Raymond B.Moeller, Cynthia A.Ludlow and Diane M.Ulrich for drawing the figures.


    Notes
 
1 To whom correspondence should be addressed. E-mail: kuo-chen.chou{at}am.pnu.com Back


    References
 Top
 Abstract
 Introduction
 Location classification
 Prediction algorithm
 Results and discussion
 References
 
Alberts,B., Bray,D., Lewis,J., Raff,M., Roberts,K. and Watson,J.D. (1994) Molecular Biology of the Cell, 3rd edn. Garland Publishing, New York, London, Ch. 1.

Andrade,M.A., O'Donoghue,S.I. and Rost,B. (1998) J. Mol. Biol., 276, 517–525.[ISI][Medline]

Bahar,I., Atilgan,A.R., Jernigan,R.L. and Erman,B. (1997) Proteins, 29, 172–185.[ISI][Medline]

Bairoch,A. and Apweiler,R. (1997) Nucleic Acids Res., 25, 31–36.[Abstract/Free Full Text]

Cedano,J., Aloy,P., Pérez-Pons,J.A. and Querol,E. (1997) J. Mol. Biol., 266, 594–600.[ISI][Medline]

Chou,K.C. (1995) Proteins: Struct. Funct. Genet., 21, 319–344.[ISI][Medline]

Chou,K.C. and Elrod,D.W. (1998) Biochem. Biophys. Res. Commun., 252, 63–68.[ISI][Medline]

Chou,K.C. and Maggiora,G.M. (1998) Protein Engng, 11, 523–538.[Abstract]

Chou,K.C. and Zhang,C.T. (1995) Crit. Rev. Biochem. Mol. Biol., 30, 275–349.[Abstract]

Chou,K.C., Liu,W., Maggiora,G.M. and Zhang,C.T. (1998) Proteins: Struct. Funct. Genet., 31, 97–103.[Medline]

Chou,P.Y. (1980) In Abstracts of Papers, Part I, Second Chemical Congress of the North American Continent, Las Vegas.

Chou,P.Y. (1989) In Fasman,G.D. (ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549–586.

Claros,M.G., Brunak,S. and von Heijne,G. (1997) Curr. Opin. Struct. Biol., 7, 394–398.[ISI][Medline]

Klein,P. and Delisi,C. (1986) Biopolymers, 25, 1569–1672.

Liu,W. and Chou,K.C. (1998) J. Protein Chem., 17, 209–217.[ISI][Medline]

Lodish,H., Baltimore,D., Berk,A., Zipursky,S.L., Matsudaira,P. and Darnell,J. (1995) Molecular Cell Biology, 3rd edn. Scientific American Books, New York, Ch. 3.

Mahalanobis,P.C. (1936) Proc. Natl Inst. Sci. India, 2, 49–55.

Mardia,K.V., Kent,J.T. and Bibby,J.M. (1979) Multivariate Analysis. Academic Press, London, pp. 322 and 381.

Nakashima,H. and Nishikawa,K. (1994) J. Mol. Biol., 238, 54–61.[ISI][Medline]

Nakashima,H., Nishikawa,K. and Ooi,T. (1986) J. Biochem., 99, 152–162.

Nakai,K. and Kanehisa,M. (1991) Proteins: Struct. Funct. Genet., 11, 95–110.[ISI][Medline]

Nakai,K. and Kanehisa,M. (1992) Genomics, 14, 897–911.[ISI][Medline]

Pillai,K.C.S. (1985) In Kotz,S. and Johnson,N.L. (eds), Encyclopedia of Statistical Sciences, Vol. 5. Wiley, New York, pp. 176–181.

Reinhardt,A. and Hubbard,T. (1998) Nucleic Acids Res., 26, 2230–2236.[Abstract/Free Full Text]

Received July 28, 1998; revised October 16, 1998; accepted October 21, 1998.