1Computer Science V, University of Mannheim, B6, 26, D-68131 Mannheim, Germany, 2Center of Excellence in Bioinformatics, State University of New York at Buffalo, 901 Washington Street, Buffalo, NY 14203-1199, USA, 3CRIBI Biotechnology Centre, University of Padua, V.le G. Colombo 3, I-35121 Padova, Italy and 4Chair for Computer Vision, Graphics and Pattern Recognition, University of Mannheim, L13, 19, D-68131 Mannheim, Germany
5 To whom correspondence should be addressed, at the address in Italy. e-mail: silvio{at}cribi.unipd.it
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: classification/enzyme code/protein fold recognition/secondary structure
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
In 1991, Bowie and co-workers developed a fold recognition method; instead of scoring the compatibility of a sequence by comparing it with a sequence of known structure, the sequence was compared with the structural information of a known three-dimensional structure. This method was termed structural 3D1D profile (Bowie et al., 1991). Since then, a variety of fold-recognition methods have been published (Godzik et al., 1992
; Jones et al., 1992
; Sippl and Weitckus, 1992
; Bryant and Lawrence, 1993
; Russel et al., 1996
; Karplus et al., 1998
; Jones, 1999a
; Thiele et al., 1999
; Fischer, 2000
; Kelley et al., 2000
; Rychlewski et al., 2000
) and several reviews on these methods have appeared (Sternberg et al., 1999
; Jones, 2000
).
With the availability of sensitive sequence alignment methods, the two-step PDB-BLAST protocol (Rychlewski et al., 2000) is increasingly used as the baseline method for defining easy and hard fold recognition targets. This protocol uses PSI-BLAST (Altschul et al., 1997
) to build a sequence profile from the non-redundant database of known sequences. The profile generated is then used in a second PSI-BLAST round to search for related sequences of known structure. It was shown to perform well during the LiveBench-2 experiment (Bjunicki et al., 2001
). Because PDB-BLAST is used to classify target difficulty, it is safe to say that for the prediction of more difficult fold recognition targets, sequence similiarity alone is unlikely to yield satisfactory results. This perception has also recently been emphasized by McGuffin and Jones in the context of discriminating novel and known folds (McGuffin and Jones, 2002
). They suggest that the use of secondary structure alignments is significantly more accurate at identifying novel folds than sequence-based methods. Owing to its speed it is also suitable for genome-wide analysis (McGuffin and Jones, 2002
).
It is widely accepted that protein function is related to structure (Bartlett et al., 2003). Several classification schemes have been proposed for protein function. Enzymes are generally easier to classify than non-enzymes because they catalyze distinct chemical reactions, which can be easily enumerated. The enzyme code scheme (NC-IUBMB, 1992
) is the best developed and most widely used functional classification scheme. Hegyi and Gerstein found statistically significant correlations between structural class and enzyme code (Hegyi and Gerstein, 1999
). The two most versatile folds had 29 (TIM barrel) and 17 (
/ß hydrolase fold) different (to the third level) enzyme codes associated with them. However, all other folds are restricted to five or fewer enzyme codes (Hegyi and Gerstein, 1999
; Bartlett et al., 2003
). This functional correlation can therefore be used in fold recognition to reduce the number of possible folds for proteins which have been previously characterized as enzymes.
Our system, called MANIFOLD (MANnheIm FOLD recognition), is a hybrid of different approaches. Instead of relying on sequence similarity alone, it also uses secondary structure and functional information to improve fold recognition. For a given target protein sequence, the program computes several different similarity scores, which represent the similarity in sequence, secondary structure and function, between the target protein and a template protein from the fold library. The new aspect of our system is the use of functional information about the query protein in form of an enzyme code. Also, as described below, we developed a non-linear ranking scheme in order to combine the scores of the three different similarity measures used.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The program generates a set of features for each template structure of the fold library. In this study a vector of three features is used: (i) a score for a global alignment of the secondary structural elements of the target and template protein; (ii) the negative logarithm of the PSI-BLAST e-value; and (iii) the logarithm of our structurefunction compatibility score (described below). For each feature vector a score is computed, which in effect is a prediction of whether a particular template belongs to the same fold class as the target protein.
Each feature score is an input into a small two-layer neural network; the output of the neural network is the similarity score of the fold recognition program (see Ranking method for more details).
Using enzyme codes and structurefunction relationships
We use the enzyme code system (NC-IUBMB, 1992) for describing the function of a protein. It is a four-level hierarchy that classifies different aspects of chemical reactions catalyzed by enzymes. The first digit denotes the class of reaction: oxidoreductase, transferase, hydrolase, lyase, isomerase or ligase. The two subsequent digits classify the type of bond involved, cofactors and other specificities. The fourth digit identifies different substrates involved in the same general reaction. The function of non-enzymes is not modeled in this study (we represent the function of a non-enzyme using the enzyme code 0.0.0.0). Some functions have only been observed with a single fold class so far, but in some cases, proteins belonging to different fold classes can perform the same or a similar function. Given the enzyme code of a protein, what is its probability of having a certain structure?
In order to model the known structures and functions, we prepared a data file that contains annotated enzyme codes of all non-identical CATH codes [version 2.4 (Orengo et al., 1997)]. CATH domains belonging to the same homology class (H-level, fourth hierarchy level in CATH) as any member of the test set were eliminated. This parameter-file serves as our structurefunction database.
We used the supplementary data files provided by the SwissProt and the Enzyme database (Bairoch and Apweiler, 2000) in order to annotate the enzyme code of the proteins of the data sets. If the PDB code of a protein is associated with several different enzyme codes, the annotation was set to the status of an unknown function (modeled by enzyme code number 0.0.0.0).
The structurefunction score models the probability that the target protein with function ftar and template protein with function ftemp belong to the same fold class. This score is computed as follows. We define an a priori probability of a protein belonging to a certain fold class to be a constant pmin = 1/650. This assumes that all fold classes are equally likely if one knows nothing about the target protein and that all fold classes are equally populated. This assumption oversimplifies the uneven distribution of different fold classes of proteins in the PDB. However, the exact value of pmin has little influence on the overall score, because small values of the function scores are all scaled to a value close to zero with the non-linear scaling of the neural network (see Ranking method).
The enzyme codes of the template and the target are compared. If the similarity between the enzyme code is less than three identical levels, the algorithm stops and the similarity score is set to the logarithm of pmin. Otherwise, a set s of proteins with non-identical CATH codes are collected from the prepared structurefunction database, which are associated with the three or four (depending on the enzyme-code similarity between target and template) common enzyme code hierarchy levels. Let v be the number of non-identical CATH codes (four levels) and let nfold be the number of different folds (three levels for CATH) of that set s. As the resulting score of the algorithm we use a convex combination between a maximum and a minimum score:
xfunc = log(pfunc) = log[(1 a)xpmin + a/nfold]
The factor a describes how sparse the known structurefunction data are for that targettemplate pair. We set a = v/(v + m), with v being the number of different CATH codes (counting four levels) found for this enzyme code. In the limit of a perfect structurefunction relationship (i.e. the same enzyme code is always associated with the same fold class, many structures are known for proteins with this function), we obtain a 1, pfunc
1. In the limit of very sparse data or very diverse structurefunction relationships, we obtain a
0 or 1/nfold
0, hence pfunc
pmin. We set m = 3. The particular values of the constants m and pmin have little influence on the performance of the structurefunction scoring term.
PSI-BLAST
We also use the e-value computed with PSI-BLAST (Altschul et al., 1997) in order to have a sequence-based similarity measure. We are searching the non-redundant sequence database provided by the NCBI. The database was processed with a filter from the PSIPRED program suite (Jones, 1999b
) in order to remove low-complexity regions, trans-membrane regions and coiled-coil segments. The performed search consists of four rounds of PSI-Blast on the non-redundant nr database with standard parameters. The resulting position-specific scoring matrix is used to search the available sequences of the template library with BLAST. This protocol, called PDB-BLAST, improves fold recognition beyond standard PSI-BLAST and was first introduced by Rychlewski et al. (Rychlewski et al., 2000
). The negative logarithms of the obtained e-values are used as sequence similarity scores.
Secondary structure similarity
As secondary structure similarity measure we used the global alignment of secondary structure elements. Our algorithm is a slightly simplified version of the secondary structure alignment procedure described by McGuffin et al. (McGuffin et al., 2001) and Przytycka et al. (Przytycka et al., 1999
).
In short, each secondary structure element is represented by a letter for the three states helix, strand or coil and a number corresponding to the length of the region. Unlike the method described by Przytycka et al. (Przytycka et al., 1999), we do not split secondary structural elements in order to obtain better matches.
As secondary structure prediction we use either the consensus procedure of Albrecht et al. (Albrecht et al., 2003) or the PSIPRED predictions (Jones, 1999b
) for both the target and template proteins. For the dataset provided by McGuffin and Jones (McGuffin and Jones, 2002
), we use the PSIPRED predictions as given in their downloadable data.
Data sets
As a test set for evaluating our fold recognition method, we used the protein data set of 542 non-redundant domains developed by McGuffin and Jones (McGuffin and Jones, 2002). They describe a subset of 252 known folds, which had at least one other match in the data set, and 290 domains with folds unique with respect to the data set. From the set of unique folds we had to take out the templates with the CATH names 1GCB01, 1GCB02, 1BPLA1, 2CGPC1 because their domain definition has changed or was not part of CATH version 2.4. As targets for the fold recognition we used the set of known folds and as the template library we used the complete set of 538 domains.
In addition, we tested our method on the protein test and training set provided by Ding and Dubchak (Ding and Dubchak, 2001). This dataset is based on SCOP (Murzin et al., 1995
) and consists of 386 SCOP domains for the testing set and 285 SCOP domains for the training set. We compile a structurefunction database, which annotates the enzyme codes of the SCOP (version 1.53) domains from the SwissProt and the Enzyme database. We leave out all homologous proteins from that database which result in a Blast e-value of <0.001 with any protein of the protein test set defined by Ding and Dubchak.
Ranking method
A non-linear ranking scheme, which is in effect a simple two-layer neural network, is used.
For each feature i the score xi is weighted with a non-linear function gi(xi):
gi(xi) = ai{tanh[ci(xi bi)] + 1}/2
The terms a, b and c correspond to a weight, an offset and a rescaling factor of the feature score, respectively. For a feature score xi << bi, the term gi(xi) is close to zero; for xi >> bi the term gi(xi) approximates the weight ai.
The total score of a template is the sum of the weighted scores of the features used:
S = igi(xi)
Evaluation
In order to evaluate our fold recognition method, we generated 20 random partitions of the McGuffin and Jones dataset of known folds into 80% training and 20% test set. For each partition, we ran a 200-step Monte Carlo optimization of the three weights ai for the three features. The parameters bi and ci are set to fixed values. For each partition and training run, the optimized weights are used to rank the proteins of the test set. If a feature is not used, its weight is set to zero during both training and testing. For the Ding and Dubchak data set, the test and training sets are fixed. Here for each feature combination we ran 20 different 200-step Monte Carlo optimizations.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
PDB-BLAST is more successful for the dataset provided by Ding and Dubchak, with about 59.1% correct first hits, because the test and training set contain some distantly homologous protein pairs. As can be seen in Table II, the highest prediction accuracy is achieved if all three similarity features are used (74.9%). An improvement in accuracy over the second best combination (secondary structure plus sequence) of nearly 1.8% is observed. These values can be directly compared with the prediction accuracy reported by Ding and Dubchak (Ding and Dubchak, 2001) for this test set. They obtained a prediction accuracy of 56% for the test and training set underlying the statistics of Table II.
A novel aspect of the present method is the use of functional information in the form of enzyme codes. The effect of this term is to reduce the number of false positives by sieving out improbable folds. In Table III we show some examples of protein targets for which a correct template is predicted with the highest score only if the functional similarity scoring term is used. The examples are taken from the McGuffin and Jones dataset. In all cases shown in Table III, the PSI-Blast e-values for the templates with the target as a query sequence are greater than the cutoff value and do not influence the ranking. The examples were chosen to highlight three different situations. The first example shows a case were a completely false prediction (wrong class) is replaced by a correct one with a perfect (four-digit) enzyme code match. In the second example, discrimination between enzymes versus non-enzymes allows one to assign the correct topology. The third example in this table is especially interesting: The target protein (CATH code 1CHD00, a carboxyl methylesterase with enzyme code 3.1.1.61) has only three digits of the enzyme code in common with the top-scoring template protein (CATH code 1WAB00, an acetylhydrolase with enzyme code 3.1.1.47). However, because of the functional similarity term, a correct template structure is predicted (three CATH levels), even though it has a slightly lower secondary structure similarity score compared with the template 1LUCA0.
|
The method, which uses a combination of secondary structure similarity and sequence similarity, is well suited for fold assignment on a genome-wide scale. Provided that the secondary structure prediction and the PDB-BLAST result of the target sequence are given, the computing time for predicting a fold from sequence using a template library of 4030 SCOP domains typically takes less than 1 min on a Pentium IV 2.0 GHz computer.
We show that by using knowledge about the function, in addition to the predicted secondary structure of a protein, it is possible to increase the fold recognition prediction accuracy beyond the values achieved by sequence alignment-based tools such as PSI-BLAST or GenTHREADER. We conclude that the secondary structure and the enzyme class of a protein contain additional information with respect to sequence similarity and this should be used extensively by fold recognition systems.
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J.H., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 4548.
Bartlett,G.J., Todd,A.E. and Thornton,J.M. (2003) In Bourne,P.E. and Weissig,H. (eds), Structural Bioinformatics. Wiley-Liss, Hoboken, NJ, pp. 387410.
Bjunicki,J.M., Elofsson,A., Fischer,D. and Rychlewski,L. (2001) Proteins, Suppl 5, 184191.
Bowie,J.U., Lüthy,R. and Eisenberg,D. (1991) Science, 253, 164170.[ISI][Medline]
Bryant,S.H. and Lawrence,C.E. (1993) Proteins, 16, 92112.[ISI][Medline]
Ding,C. and Dubchak,I. (2001) Bioinformatics, 17, 349358.[Abstract]
Fischer,D. (2000) In Altman,R.B., Dunker,A.K., Hunter,L., Lauderdale,K. and Klein,T.E. (eds), Proceedings of the Pacific Symposium on Biocomputing. World Scientific, Hawaii, pp. 119130.
Godzik,A., Kolinski,A. and Skolnick,J. (1992) J. Mol. Biol., 227, 227238.[ISI][Medline]
Hegyi,H. and Gerstein,M. (1999) J. Mol. Biol., 288, 147164.[CrossRef][ISI][Medline]
Jones,D.T. (1999a) J. Mol. Biol., 287, 797815.[CrossRef][ISI][Medline]
Jones,D.T. (1999b) J. Mol. Biol., 292, 195202.[CrossRef][ISI][Medline]
Jones,D.T. (2000) Curr. Opin. Struct. Biol., 10, 371379.[CrossRef][ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846856.[Abstract]
Kelley,L.A., McCallum,C.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 501522.[CrossRef]
McGuffin,L.J. and Jones,D.T. (2002) Proteins, 48, 4452.[CrossRef][ISI][Medline]
McGuffin,L.J., Bryson,K. and Jones,D.T. (2001) Bioinformatics, 17, 6372.[Abstract]
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995). J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
NC-IUBMB (1992) Nomenclature Committee of the Union of Biochemistry and Molecular Biology. Enzyme Nomenclature. Academic Press, New York.
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Przytycka,T., Aurora,R. and Rose,G. (1999) Nat. Struct. Biol., 6, 672682.[CrossRef][ISI][Medline]
Rost,B. (1999) Protein Eng., 12, 8594.
Russel,R.B., Copley,R.R. and Barton,G.J. (1996) J. Mol. Biol., 259, 349365.[CrossRef][ISI][Medline]
Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Protein Sci., 9, 232241.[Abstract]
Sippl,M.J. and Weitckus,S. (1992) Proteins, 13, 258271.[ISI][Medline]
Sternberg,M.J.E., Bates,P.A., Kelley,L.A. and MacCallum,R.M. (1999) Curr. Opin. Struct. Biol., 9, 368373.[CrossRef][ISI][Medline]
Thiele,R., Zimmer,R.M. and Lengauer,T. (1999) J. Mol. Biol., 290, 757779.[CrossRef][ISI][Medline]
Received February 19, 2003; revised August 21, 2003; accepted September 12, 2003.