Protein threading by PROSPECT: a prediction experiment in CASP3

Ying Xu1,2, Dong Xu1, Oakley H. Crawford1, J.ralph Einstein1, Frank Larimer, Ed Uberbacher, Michael A. Unseren1,3 and Ge Zhang1

1 Computational Protein Structure Group, Computational Biosciences Section, Life Sciences Division and 3 Center for Engineering Science Advanced Research, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830-6480, USA


    Abstract
 Top
 Abstract
 Introduction
 Results
 Discussion
 References
 
We present an analysis of the protein fold recognition experiment using PROSPECT in The Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP3). PROSPECT is a computer program we have recently developed for finding an optimal alignment between a protein sequence and a protein structural fold. Two unique features of PROSPECT are (a) that it guarantees to find the globally optimal sequence–structure alignment and does so in an efficient manner, when the alignment-scoring function consists of three additive terms: (i) a singleton fitness term, (ii) a pairwise contact preference term between residues that are spatially close (<=15 Å between their ß-carbons) and (iii) an alignment gap penalty; and (b) that it guarantees to find the globally-optimal alignment under various constraints on the unknown protein specified by the user. In the CASP3 experiment, PROSPECT correctly identified the most similar folds for 11 targets and predicted closely-similar folds for five other targets among the 23 targets which can be classified into the category of fold-recognition problems and also had their experimentally-determined structures available. Among the 11 correctly identified folds, PROSPECT obtained good sequence–structure alignments for nine of them. On three of the five ab initio prediction problems, PROSPECT successfully located partial structures from our template library, which align accurately with the corresponding targets.

Keywords: combinatorial optimization/fold recognition/protein threading/protein structure prediction


    Introduction
 Top
 Abstract
 Introduction
 Results
 Discussion
 References
 
CASP3 is the third community-wide protein structure prediction experiment in its series, which started five years ago (CASP, 1995Go). CASP3 consisted of three prediction categories: (i) homology-based structure prediction; (ii) fold recognition-based structure prediction and (iii) ab initio structure prediction. Forty-three prediction targets were provided by structural experimentalists for predictions, of which 36 structures were solved but not published at the time of prediction. We participated for the first time in the CASP experiments. Though our main focus was on the category of fold recognition, we made predictions on all 43 targets. The purpose of this paper is to analyze and assess our performance at CASP3 mainly in the category of fold recognition.

It is generally believed that both singleton fitness and pairwise contact preference between spatially nearby residues, along with alignment gap penalties, need to be considered in scoring the quality of a sequence–structure alignment for the purpose of protein fold recognition. But one problem caused by such a scoring scheme is the lack of effective computational methods which can both guarantee to find an optimal sequence–structure alignment, and do so in an efficient manner (within a practically acceptable amount of time). (Throughout this paper, an optimal alignment means a globally-optimal alignment.) To avoid the computational difficulty, most existing threading programs employ heuristic or, in some cases, stochastic methods for finding an optimal sequence–structure alignment. With this type of threading method, it is difficult to determine if the scoring function is inadequate or the optimizer used did not find the function's global optimum, when the prediction performance is not good. To address this problem, we have recently developed an algorithm which guarantees to find an optimal sequence–structure alignment and does so efficiently when pairwise contacts are considered only between nearby residues, say up to 15 Å between their ß-carbon atoms (Xu et al., 1998bGo; Xu,Y. and Xu,D., manuscript submitted). We have implemented this algorithm as a computer program, called PROSPECT (PROtein Structure Prediction and Evaluation Computer Toolkit) (Xu,Y. and Xu,D., manuscript submitted). CASP3 provided an excellent opportunity for testing and evaluating the PROSPECT program.

During the CASP3 experiment (CASP, 1998Go), we noticed that a number of targets had a limited amount of structural information, e.g. disulfide bonds or active sites, available in the literature before their structures were fully solved. For example, target t0072 was known to have three disulfide bonds between certain pairs of cysteines. (Unfortunately, the experimentally-determined structure of t0072 was not made available for the assessment of predictions in the CASP3 experiment. Hence no analysis will be done on t0072 in this paper.) This type of information should help improve the fold recognition and the sequence–structure alignment results when applied to the protein threading process. Because of observations of this kind, we have generalized the PROSPECT threading algorithm to deal with a more general class of threading problems, which we term constrained optimal threading problems, where the goal is to find an optimal sequence–structure alignment under various (mainly geometrical) constraints on certain residues of the unknown structure. Our experience has shown that even a small number of constraints could improve both the prediction accuracy and the computational efficiency significantly.

Materials and methods

The prediction procedure in our CASP3 exercise consists of three main steps for each target: (i) information collection; (ii) information-based (or information-constrained) protein threading using PROSPECT and (iii) post-processing of fold recognition results. Using this procedure, we made structure predictions on all the targets in CASP3. The following section summarizes the prediction targets in CASP3.

Prediction targets in CASP3

There were 43 targets (t0043–t0085) for prediction in CASP3. Among them, 35 have their experimental structures available so far. (Target t0054 was used in the CASP3 assessment, but it is not available to the predictors. Hence we do not include it in our discussion.) A detailed description of the targets can be found at the CASP3 official web page (http://PredictionCenter.llnl.gov/casp3/ ). The 35 targets can be roughly grouped into six categories.

The analysis of our performance in CASP3 prediction experiment described in this paper focuses on the 23 targets belonging to the family, superfamily and fold categories. These are the categories of targets where threading is most useful for fold recognition. We include the family category in our analysis mainly to show how PROSPECT improves the accuracy of a sequence–structure alignment over a typical sequence comparison program. Further, we give a brief summary of how PROSPECT performed in the ab initio category.

Pre-processing: information collection

The goal of the information collection step was to extract as much information about each target from the literature or from computational results using publicly available tools as we could within the time limit (roughly 1 day per target). This step included (a) a literature search for structural or functional information about each target; (b) a database search against PDB (Bernstein et al., 1977Go) using BLAST (Altschul et al., 1997Go) for possible homologs; (c) secondary structure prediction using PHD (Rost and Sander, 1993Go) and PSA (Stultz et al., 1993Go, White et al., 1994Go); (d) prediction of motifs or potential active sites by MOTIF (at http://www.motif.genome.ad.jp); (e) multiple sequence alignment against the SWISS-PROT database (Bairoch and Apweiler, 1999Go) to determine the potentially conserved positions of each target. The information collected was used either as constraints during the threading process or to narrow down the candidate list for fold recognition.

Protein threading by PROSPECT

Each target was threaded against a subset of FSSP (Holm and Sander, 1996Go) (release of April 1998), unless a smaller set of remote homologs was identified. (A template fold in PROSPECT's template library needs to be pre-processed, which could take a significant amount of time. The development of the PROSPECT system was not completed until a few weeks into the CASP3 prediction season; hence some of the template folds are not pre-processed and put into our template library in time.) PROSPECT found an optimal alignment between the target and each template in the template library. An `optimal' alignment is determined in two steps: (i) finding an optimal alignment between the target and core elements of the template, while penalizing length differences between the corresponding loop regions, using a divide-and-conquer algorithm (Xu et al., 1998aGo; Xu,Y. and Xu,D., manuscript submitted) [cores consist of {alpha}-helices and ß-strands, as determined by DSSP (Kabsch and Sander, 1983Go)]; (ii) alignments between loop regions are done separately after core elements are aligned, using a sequence–sequence alignment algorithm like Smith–Waterman (Smith and Waterman, 1981Go).

The quality of an alignment between a target and a template's core elements is measured by a linear combination of (a) Es, a singleton fitness term which measures how well a residue-type fits a particular environment, defined by the original residue type in this template position, its secondary structure and solvent accessibility; (b) Ep, a pairwise contact term which measures how preferable to have a pair of particular residue types in close contact (up to 15 Å between ß-carbons) and (c) El, a term which penalizes the length differences of corresponding loops. More specifically, the following scoring function is used to measure an alignment ({S2, T1}, {S4, T2}, . . ., {S2M, TM}; odd-numbered Sis are aligned with loops) between a sequence s and a series of cores T1, . . ., TM:

where {S1, S2, . . ., S2M+1} is a partition of s (each Si is a substring of s) with ||S2i|| = ||Ti||, for all i 2 [1, M]; X[k] represents the kth element of X; loop(Ti, Ti+1) denotes the loop length between core Ti and Ti+1 and || · || means the size of a set.

The divide-and-conquer algorithm (Xu et al., 1998aGo; Xu,Y. and Xu,D., manuscript submitted) of PROSPECT finds an alignment which achieves the global optimum of the scoring function (1). A significant and unique property of this algorithm is that it finds the globally-optimal alignment efficiently. The global optimality is achieved by implicitly searching through the whole alignment space, and its computational efficiency is obtained by avoiding explicit examination of a significant portion of the search space, mathematically proven not to contain the global optimum. Table IGo summarizes the amount of time used to find an optimal alignment between each target and its best scoring template structure for all the targets except t0075, which PROSPECT incorrectly identified as a new fold. (We did not put the total threading time against the whole template set because the size of this template library varied from target to target.)


View this table:
[in this window]
[in a new window]
 
Table I. A summary of computing time of threading by PROSPECT
 
To deal with situations where corresponding cores of two aligned structures may have different lengths, the current version of PROSPECT allows deletions/additions of up to two residues at each end of a core in the template. The number of deleted/added residues from/to a core is determined by the optimization algorithm of PROSPECT, i.e. the one that gives the globally optimal score is applied. More residues could be added or deleted in a core alignment, but that would increase the computational time significantly.

Information-constrained fold recognition

PROSPECT allows a user to incorporate structure-related information about a specific target during the threading process. This is done by modifying the objective function (1) based on the user's input, and by generalizing the divide-and-conquer algorithm. Four types of constraints are currently allowed: (a) distance constraints between certain pairs of residues in the target sequence; (b) consistency with known or predicted secondary structure information; (c) consistency with given partial alignments; (d) the maximum number of core elements allowed to be removed from the template structure in a threading alignment. Constraints have been used in CASP3 to improve both the fold recognition accuracy and the computational efficiency of threading by reducing the alignment space to be searched. A detailed description of how the generalized algorithm deals with these constraints will be given elsewhere.

Pairwise distance constraints

Information about distances between certain pairs of residues (e.g. between their ß-carbons) in the target may be known before its full structure is determined, e.g. disulfide bonds or the residues in an active site. A distance constraint specified as `distance(si, si') [ai, bi]' is enforced in PROSPECT by adding to the objective function (1) a term which penalizes deviation of distance (si, si') between residues si and si' from the specified range [ai, bi]. When the penalty is set so high that no distance (si, si') [ai, bi] is violated, the distance constraint is a hard constraint; otherwise it becomes a soft constraint. During the search for an optimal alignment, partial alignments with significant violations of the distance constraints are discarded for further consideration. This reduces the size of the alignment search space and hence improves the computational efficiency.

Secondary structure information

It is estimated that the best secondary structure prediction programs identify the secondary structure (helix, strand or loop) for residues of the target sequence with approximately 70% accuracy. Multiple sequence alignments with related proteins could also reveal secondary structure information about particular residues. Secondary structures of small proteins could also be determined by NMR before the full structures are solved (e.g. target t0056 was provided with the NMR-determined secondary structures, six {alpha}-helices). Effectively using this type of information should help improve both the fold recognition accuracy and computational efficiency. For each residue with specified secondary structure information, PROSPECT adds a term to the objective function (1), which penalizes the difference between the predicted or known secondary structure type and the type of its aligned core secondary structure. A variable weight is used for such a penalty term, based on the confidence level of the secondary structure information. During the CASP3 experiment, one heuristic we used was that a residue in the center of a run of residues, predicted to be a loop with high confidence values (score = 9) by PHD (Rost and Sander, 1993Go), can be aligned only with a loop residue in the template. This heuristic did not cause any alignment accuracy problem (on the CASP3 targets), but significantly reduced the computational time. We have applied similar ideas to threading-alignment predictions by other threading programs (Jones et al., 1992Go; Crawford, 1999Go).

Partial alignments

Certain residues or runs of residues (short fragments) in the target sequence may be conserved among the family members. This information could be obtained through multiple sequence alignments or by literature search (for active sites, for example). A correct alignment between the target sequence and its native-like template structure should be consistent with those (preferred) alignment fragments. A term is added to the objective function (1), which rewards the realization of each such partial alignment during the threading process. A variable weight is assigned to each such rewarding term to reflect the confidence level or the significance of a predicted conserved element. Known partial alignments are also used to reduce the computational time as do other constraints mentioned above.

Unaligned cores

It is well known that similar structural folds may not be fully aligned with the target due to the difference in the structures. Certain cores of the template or segments of the target sequence may have to be removed from the alignment. This is particularly true in cases where the most similar fold in the template library is not from the same protein family as the target sequence. While PROSPECT allows segments of the target, at any position of the sequence, to be unaligned and treats them as loop regions, it allows deletions of cores only at the ends of the template but not in the middle. A user-specified parameter, D, is used to control the maximum number of cores that can be deleted at each end of a template. (D can be automatically set based on the length difference between a query sequence and the template.) Deletion of cores at the ends of a template is achieved by generalizing the threading problem defined in Eqn 1 to the following problem. The goal is to find k1 and k2, with k1, k2 [1, D], such that alignment ({S2k1, Tk1}, . . . {S2(Mk2), TM–k2}) optimizes the following function:

where Tk1, M–k2 is the partial template from core k1 to core Mk2.

This generalized threading problem can be solved using a slight generalization of the divide-and-conquer algorithm (Xu et al., 1998aGo; Xu,Y. and Xu,D., manuscript submitted). The algorithm compares all the threading alignments with no cores up to D cores removed from each end of the template, and selects the best scoring one. A careful implementation of the algorithm increases the computational time insignificantly, not by D2 times as a naive implementation would do. This generalized approach has provided the flexibility of allowing alignment between a target sequence and a partial template. Deletions of cores in the middle of a template can theoretically be done similarly but this would greatly increase the computational time, due to the overall structure of our divide-and-conquer algorithm. Research is currently under way to develop more efficient ways to handle deletions of cores in the middle of a template.

PROSPECT provides a standard format, which allows a user easily to specify the constraints for a target before the threading process starts; then it automatically modifies the objective function according to the constraints and solves the constrained threading problem.

Post-processing and submission

For each target of CASP3, the fifty best-scoring sequence–structure alignments were selected for post-processing. The post-processing step consisted mainly of (a) checking for compactness (b) generating atomic models for the top 10 alignments which passed the compactness test and (c) ranking the atomic models and their corresponding alignments. Then the top five sequence–structure alignments (called model-1, . . ., model-5) and the top one or two atomic models were selected and submitted as the final predictions, depending on whether their scores were good enough.

(a) Since PROSPECT allows deletions of cores at the ends of a template, the aligned substructure may not always be compact, as measured by its radius of gyration. (Research is currently ongoing to use substructure compactness as a constraint when deleting cores.) Uncompact aligned substructures were removed from further consideration.

(b) All the atomic structures were generated by MODELLER (Sali and Blundell, 1993Go). Multiple templates were fed to MODELLER as the starting structures if several high-quality template structures (identified as folds similar to the target) were available. Ten structures were generated for each alignment.

(c) We then used WHATIF (Vriend, 1990Go) and PROCHECK (Laskowski et al., 1993Go) to assess (i) the packing and backbone conformations of each structure, (ii) the consistency between the secondary structures predicted by PHD (Rost and Sander, 1993Go) and those of the predicted model assigned by DSSP, (iii) the inside/outside occupancies of hydrophobic and hydrophilic residues and (iv) the existence of D-amino residues or cis peptide bonds in the predicted model; and chose the best one to be the atomic model. In case none of the 10 structures had acceptable packing or backbone conformations, we adjusted the alignment and rebuilt the model. We also visually checked the structures and tuned the alignment if necessary.

Post-prediction assessment

To evaluate the prediction accuracy, we have conducted both sequence-independent superposition and sequence-dependent superposition between the actual structure and the predicted structure. Structure–structure alignments in a sequence independent superposition were performed by SARF (Alexandrov, 1996Go; Alexandrov and Fischer, 1996Go). Though the number of structurally alignable residues calculated by SARF is often different from the one calculated by ProSup (Feng and Sippl, 1996Go), the difference between the two is generally insignificant. The sequence-dependent superpositions, examples of which are shown in Figures 1 and 6GoGo, were aligned by the protein least squares fitting program ProFit (A.C.R.Martin, University College London). The schematic diagrams of the proteins were made by the Insight II package (Molecular Simulations Inc., 1995), and the C{alpha} traces were made by VMD (Humphrey et al., 1996Go).



View larger version (43K):
[in this window]
[in a new window]
 
Fig. 1. A comparison between the predicted structure (thick lines) and the experimental one (thin lines) for target t0068 (left) and target t0074 (right).

 


View larger version (15K):
[in this window]
[in a new window]
 
Fig. 6. A comparison between the predicted structures (thick lines) and the experimental ones (thin lines) for targets t0065 (left), t0073 (middle) and t0084 (right).

 

    Results
 Top
 Abstract
 Introduction
 Results
 Discussion
 References
 
A summary of the performance of our CASP3 predictions is shown in Table IIGo. The overall quality of predictions can be grouped into the following categories: (i) high quality, the model and the experimental structure has a low C{alpha}-r.m.s.d. in the sequence-dependent superposition; (ii) good alignment, the model has small errors in the alignment to a correct fold; (iii) recognized, the model identifies the correct fold, but the alignment is not good; (iv) motif, a motif or a supersecondary structure is predicted correctly with a small C{alpha}-r.m.s.d. to the corresponding portion of the experimental structure in the sequence-dependent superposition; (v) closely-similar fold, the model shares similar features in the overall architecture with the experimental structure; (vi) unrecognized, the model has a significantly different fold from the experimental structure. We will analyze our performance according to these categories for each of the three classifications of targets, namely, family, superfamily and fold, which were defined in the Introduction.


View this table:
[in this window]
[in a new window]
 
Table II. CASP3 prediction assessment
 
Family

All the structures of predicted model-1 in this category have good alignment to the correct fold. Figure 1Go shows two examples of the quality of the models in the sequence-dependent superposition. For target t0068, the core structure of the fold is highly conserved between the target and the template (the C{alpha}-R.m.s.d. between the model and the experimental structure is only 1.76 Å for 77% of the structure in the sequence-independent superposition), while there are some conformational differences in the peripheral elements. For target t0074, the secondary structures and the alignments are predicted correctly (no shift error of alignment in the core structure of the fold), while there are some shifts in the relative positions of secondary structures between the template and the experimental structure. Our threading alignment of each model is much better (see Table IIIGo) than the alignment produced by the sequence–sequence pairwise alignment program ALIGN (http://www2.igh.cnrs.fr/bin/align-guess.cgi). However, the quality of our alignments is similar to PSI-BLAST (Altschul et al., 1997Go) results (the sequence alignments using multiple sequence profile). Unfortunately, we failed to use PSI-BLAST for our CASP3 predictions.


View this table:
[in this window]
[in a new window]
 
Table III. Comparison between alignments of PROSPECT and ALIGN
 
Superfamily

In the superfamily category (eight targets), PROSPECT recognized three correct folds, found two closely-similar folds and missed three targets. Prediction for t0053 constitutes a successful example of applying constraints during the threading process. After the first-pass threading of t0053 against the whole template library, we found that the top scoring template (1ak1) has an active site at His183 (Al-Karadaghi et al., 1997Go). Using the BLOCK search (Henikoff and Henikoff, 1994Go), we identified His145 of t0053 as the most likely active site. We then ran PROSPECT to thread t0053 against 1ak1 under the constraint that t0053's His145 has a high probability to be aligned with His183 of 1ak1, and made the final sequence–structure alignment prediction.

It turned out that 1ak1 is actually the fold of the experimental structure. t0053 and 1ak1 have only 11.2% identity in the sequence-independent structure–structure alignment. In a more detailed analysis of the unconstrained alignment, only one of the 13 alignable cores of t0053 was aligned correctly with 1ak1, and one more was aligned with a four-residue shift. Our final prediction with the constraint correctly aligned five cores, and aligned each of the other eight cores with at most a five-residue shift. The C{alpha}-r.m.s.d. between the predicted three-dimensional structure and the experimental one is 7.7 Å for all the experimentally-solved residues (257 in total); and 174 residues (66% of the target) have a C{alpha}-r.m.s.d. of 4.0 Å (Koehl and Levitt, 1999Go) (see Figure 2Go).



View larger version (61K):
[in this window]
[in a new window]
 
Fig. 2. A comparison between the predicted structure (left) and the experimental one (right) for target t0053. The cylinders indicate {alpha}-helices, the strands indicate ß-sheets, the dark lines indicate turns, and the thin lines indicate loops.

 
We identified the correct fold for target t0085 (template 1fgj). There are 72 structurally alignable residues (only about one third of the target) between the target and the template, as found by SARF. Our model-1 had the highest number of equivalent residues in a sequence-independent superposition among all the models submitted by the CASP3 teams. Furthermore, several motifs were predicted correctly, e.g. the helix–turn–helix motif (residues 6–34) as shown in Figure 3Go, and residues 38–62 in model-1 (structurally alignable with residues 37–64 in the experimental structure).



View larger version (37K):
[in this window]
[in a new window]
 
Fig. 3. A comparison between the model-1 (left) and the experimental structure (right) for target t0085. The highlighted segments in protein structures show the residues 6–35 of the model-1 and residues 6–34 of the experimental structure. The structurally equivalent positions defined by SARF are shown by links of lines, where the upper and the lower sequences are for the model and the experimental structure, respectively, and the numbers indicate the sequence number of the protein.

 
We failed to place the correct fold as the top choice (model-1) for the other six targets in this category. However, PROSPECT recognized the correct fold (1psd-A) for target t0081 as model-4 with a poor alignment. It turned out that using three iterations of PSI-BLAST, one would easily identify the correct fold with a good alignment, which we failed to do.

Models for t0063 and t0080 are closely-similar folds to the experimental structures. Both model -1 of t0063 and its experimental structure have ß-barrel folds, but the former has one domain while the latter has two domains. The overall packing between different segments in t0080 is similar, although some core secondary structures vary. We missed the correct fold for t0044 due to a program bug in PROSPECT. After the bug was fixed, we re-ran PROSPECT on t0044, which placed the correct fold as number 1 scoring template with a good alignment. Unfortunately, the bug was noticed and fixed days after the expiration date for t0044.

Fold

Among the eight targets in this category, PROSPECT recognized the correct fold with a good alignment for one, found closely-similar folds for four and missed three.

PROSPECT identified the correct fold (1tup-A) for target t0067 but as its second choice (model-2). (1tup-A has 101 structurally equivalent residues with t0067 with C{alpha}-r.m.s.d. of 3.3 Å, one of the best templates in PDB according to the DALI server (Holm and Sander, 1993Go).) Figure 4Go shows a comparison of the two structures. The core elements of t0067 consist of four antiparallel ß-strands (two ß-hairpin motifs). Its topology is one of the infrequent ones among the eight observed topologies of two sequentially adjacent hairpin motifs (Branden and Tooze, 1999Go). Our prediction had the correct topology between the four ß-strands, with a small alignment error. Among the 75 structurally alignable residues (calculated by ProSup) between model-2 and the experimental structure of t0067, 40 residues had alignment shifts of four positions or less. In addition, the two short ß-strands at each side of the four long ß-strands and a small ß-sheet (two ß-strands) at the lower-left of the structures are identified. This example demonstrates that PROSPECT is capable of identifying a fold whose peripheral elements differ significantly from the target. Unfortunately, we did not place the template 1tup-A as the top choice (model-1). The core elements of our model-1 (template 1pbo-A) are also antiparallel ß-strands but they form a ß-barrel.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 4. A comparison between our predicted structure (the left protein) and the experimental one (the right protein) for target t0067. The lower-left graph shows the topology of the four long ß-strands in both the structures from the N-terminal to the C-terminal.

 
PROSPECT found closely-similar folds for four targets (t0043, t0046, t0059 and t0061). In model-1 of target t0043, the overall architecture of the fold [ß-sheets surrounded by helices, a superfold called `doubly wound' (Orengo et al., 1994Go)] was identified correctly, as shown in Figure 5Go. Also, a motif (ß-sheet–helix–ß-sheet for residues 1–41) was predicted correctly in the model. However, the directions of some secondary structure elements and the topology between them were not predicted correctly. In model-1 of target t0046, the ß-barrel architecture was identified correctly, but the barrel of our model is shorter and fatter than the ones in the experimental structure. Both model-1 of target t0059 and its experimental structure have a ß-barrel architecture. The experimental structure of t0061 is a compact form of our predicted model-1 (an {alpha}-helical fold).



View larger version (46K):
[in this window]
[in a new window]
 
Fig. 5. A comparison between our predicted structure (left) and the experimental one (right) for target t0043.

 
Peptide conformations and new folds

With its capability of aligning a target with a partial structure, PROSPECT aligned the small targets t0065, t0073 and t0084 with substructures of the templates 1abv, 1cos-A and 1uby, respectively, though these targets are in the category of ab initio predictions. These alignments with the partial structures directly gave accurate structure predictions, with C{alpha}-r.m.s.d.s of 4.2, 2.1 and 0.8 Å, respectively, as shown in Figure 6Go. Our model-1 for target t0073 is among one of the better models submitted to CASP3 (CASP, 1998Go). However, we failed to recognize targets t0052 and t0056 as new folds.


    Discussion
 Top
 Abstract
 Introduction
 Results
 Discussion
 References
 
CASP3 provided an excellent opportunity for testing the program PROSPECT, which we have been developing in the past year. One of the goals of the PROSPECT project is to develop mathematically rigorous and practically useful algorithms for protein threading. One of the key findings of this research is that the dominating factor in computational difficulty of rigorously solving a protein threading problem is what we term the topological complexity of a template structure, instead of the length of the template or the target sequence (Xu et al., 1998aGo; Xu,Y. and Xu,D., manuscript submitted). We found that the topological complexity of a template grows very slowly as the length of the template and the maximum distance in defining a pairwise contact increase. Hence, even though the computational complexity of our threading algorithm (Xu et al., 1998aGo) is an exponential function of the topological complexity, its running time is practically acceptable as the vast majority of templates have very small topological complexities, e.g. less than or equal to six when pairwise contacts are defined between residues at most 15 Å apart (measured between their ß-carbons) (Xu,Y. and Xu,D., manuscript submitted).

Our first-time performance in CASP (CASP, 1998Go; Koehl and Levitt, 1999Go) testifies to the advantage of having a threading algorithm, like PROSPECT, which guarantees to find the globally optimal sequence–structure alignment. PROSPECT recognized four correct folds in the superfamily/fold categories, in which none of the targets had identifiable sequence homology to any protein in PDB. Among the four targets, two of them had good alignments with the correct folds. Even when PROSPECT failed to identify the best available fold, the selected template often shared motifs and/or overall architecture of the target structure. PROSPECT is good at identifying the best templates for short peptides, as shown by the success in all three peptide targets (two of which are synthetic). This demonstrates that PROSPECT can be a powerful tool for peptide design. In the family category, though the correct fold can be identified without using threading, the alignment generated by the local sequence–sequence alignment program BLAST often covers only a small portion of the target sequence. PROSPECT generally produced much better alignments than the global sequence–sequence alignment program ALIGN. Although PSI-BLAST can also give a good alignment for a large portion of the target sequence when multiple homologous sequences to the target are available, many protein sequences have no unambiguous homology (defined by BLAST) to other known protein sequences (so-called `orphan genes'), or have too few homologs to produce a reliable multiple sequence profile. In these cases, PROSPECT can be an ideal tool, particularly when threading sequences at the genome scale.

While some success was achieved by our threading method in CASP3, a number of lessons have been learned from where we failed. These lessons should provide guidance for development of more accurate and efficient computational methods for protein structure predictions.

The similarity between some targets and their templates is not very strong, and it is unclear whether the current energy function (using knowledge based singleton and contact potential) is capable of identifying the fold. For example, for target t0075, for which we predicted a new fold, the best match (1rlr) found by DALI can align only 59 residues with z-score 3.5 and r.m.s.d. 4.3 Å. The closest structure neighbor (1jvr) found by VAST (Gibrat et al., 1996Go) can align only 25 residues out of 110 to the experimental structure. (Depending on the definition of new fold, this target could be accepted as a new fold.) The template 1rlr was ranked by PROSPECT the 59th out of more than 1300 templates, indicating some preference for the fold, but insufficient for accepting it. The big variation in the core elements made the current energy function lose significant recognition power (Russell and Barton, 1994Go). A new type of energy function which is more sensitive to the fold than to the atomic coordinates may be needed.

Several new features have been added to the PROSPECT system since CASP3 (see http://compbio.ornl.gov/structure/prospect/ for details). These include (i) a better definition of template cores, which exclude core secondary structures or residues that do not significantly interact with other parts of the protein; (ii) an optional energy term using a position-dependent profile of the target sequence, derived from SAM (Karplus et al., 1998Go; Park et al., 1998Go); (iii) the DALI domain library in addition to the FSSP chain library as the threading templates. In addition, the system runs a PSI-BLAST search before the threading process, and adds any significant hits to the template library. These new additions have significantly improved the performance of PROSPECT in both the category of fold recognition and sequence–structure alignments on 700 pairs of aligned structures of FSSP (Holm and Sander, 1996Go). On the CASP3 targets of the fold recognition category, the new version of PROSPECT recognizes four more targets, t0044, t0079, t0081 and t0083, with good alignments while maintaining a similar performance level on other targets. In this exercise, a fold is considered to be correctly recognized if it ranks among the top five (just as in CASP3). Table 4Go summarizes the prediction improvement of PROSPECT on these four targets.


View this table:
[in this window]
[in a new window]
 
Table IV. Improved predictions by PROSPECT
 


    Acknowledgments
 
This research was sponsored by the Office of Health and Environmental Research, US Department of Energy, under Contract No. DE-AC05-96OR22464 with Lockheed Martin Energy Research Corporation. M.A. Unseren's research was also supported in part by the Engineering Research Program Office of Science, US Department of Energy, under Contract No. DEAC05-96OR22464 with Lockheed Martin Energy Research Corporation. We would like to thank our colleagues R.Mural, V.Olman, M.Shah, D.Hyatt and M.Galloway for their help and insightful comments during the CASP3 experiment.


    Notes
 
2 To whom correspondence should be addressed Back


    References
 Top
 Abstract
 Introduction
 Results
 Discussion
 References
 
Al-Karadaghi,S., Hansson,M., Nikonov,S., Jonsson,B. and Hederstedt,L. (1997) Structure, 5, 1501–1510.[ISI][Medline]

Alexandrov,N.N. (1996) Protein Engng, 9, 727–732.[Abstract]

Alexandrov,N.N. and Fischer,D. (1996) Proteins: Struct. Funct. Genet., 25, 354–365.[ISI][Medline]

Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bairoch,A. and Apweiler,R. (1999) Nucleic Acids Res., 27, 49–54.[Abstract/Free Full Text]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) J. Mol. Biol., 112, 535–542.[ISI][Medline]

Branden,C. and Tooze,J. (1999) Introduction to Protein Structure, 2nd edn. Garland Publishing, Inc., New York.

CASP (1995) Proteins: Struct. Funct. Genet., 23, 295–462.[ISI][Medline]

CASP (1998) Third Meeting on the Critical Assessment of Techniques for Protein Structure Prediction, December 13–17. Asilomar Conference Center, California.

Crawford,O.H. (1999) Bioinformatics, 15, 66–71.[Abstract/Free Full Text]

Feng,Z.K. and Sippl,M.J. (1996) Fold. Des., 1, 123–132.[ISI][Medline]

Gibrat,J.F., Madej,T. and Bryant,S.H. (1996) Curr. Opin. Struct. Biol., 6, 377–385.[ISI][Medline]

Henikoff,S. and Henikoff,J.G. (1994) Genomics, 19, 97–107.[ISI][Medline]

Holm,L. and Sander,C. (1993) J. Mol. Biol., 233, 123–138.[ISI][Medline]

Holm,L. and Sander,C. (1996) Science, 273, 595–602.[Abstract/Free Full Text]

Holm,L. and Sander,C. (1998) Proteins: Struct. Funct. Genet., 33, 88–96.[ISI][Medline]

Humphrey,W.F., Dalke,A. and Schulten,K. (1996) J. Mol. Graphics, 14, 33–38.[ISI][Medline]

Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 86–89.[ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2577–2637.[ISI][Medline]

Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846–856.[Abstract]

Koehl,P. and Levitt,M. (1999) Nature Struct. Biol., 6, 108–111.[Medline]

Laskowski,R.A., MacArthur,M.W., Moss,D.S. and Thornton,J.M. (1993) J. Appl. Crystallogr., 26, 283–291.[ISI]

Molecular Simulations, Inc. (1995) Insight II (Release 95.0) San Diego, Molecular Simulations, Inc., CA.

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[ISI][Medline]

Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, 631–634.[ISI][Medline]

Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 1201–1210.[ISI][Medline]

Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584–599.[ISI][Medline]

Russell,R.B. and Barton,G.J. (1994) J. Mol. Biol., 244, 332–350.[ISI][Medline]

Sali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779–815.[ISI][Medline]

Schmidt,R.B., Gerstein,M. and Altman,R.B. (1997) Protein Sci., 6, 246–248.[Abstract/Free Full Text]

Smith,T.F. and Waterman,M.S. (1981) Adv. Appl. Math., 2, 482–489.

Stultz,C.M., White,J.V. and Smith,T. (1993) Protein Sci., 2, 305–314.[Abstract/Free Full Text]

Vriend,G. (1990) J. Mol. Graphics, 8, 52–56.[ISI][Medline]

White,J.V., Stultz,C.M. and Smith,T. (1994) Math. Biosci., 119, 35–75.[ISI][Medline]

Xu,Y., Xu,D. and Uberbacher,E.C. (1998a) In Istrail,S., Pevzner,P. and Waterman,M. (eds), The Second Annual International Conference on Computational Molecular Biology. ACM, New York, pp. 285–292.

Xu,Y., Xu,D. and Uberbacher,E.C. (1998b) J. Comp. Biol. 5, 597–614.

Received February 27, 1999; revised June 16, 1999; accepted August 5, 1999.