1Protein Informatics Group, Life Sciences Division and 2Computer Sciences and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480, USA
3 To whom correspondence should be addressed at: Protein Informatics Group, 1060 Commerce Park Drive, MS 6480, ORNL, Oak Ridge, TN 37831-6480, USA. e-mail: xyn{at}ornl.gov
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: fold recognition/PROSPECT/protein structure prediction/threading/z-score
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Generally, existing fold recognition methods fall into two classes. The first class uses solely sequence information. The hidden Markov model (HMM) methods (Karplus et al., 1999) and PSI-BLAST (Altschul et al., 1997
) can be classified into this category. The second class uses structural information in addition to the sequence information, in various ways. In the profile method introduced by Bowie et al. (Bowie et al., 1991
), structural information, representing the local environment, is coded into each residue of a structural template; various dynamic programming schemes (global, semi-global, local and globallocal) (Waterman, 1995
) are used for finding the optimal sequenceprofile alignment. In the threading approach, structural information is used more explicitly through evaluating the compatibility between a query sequence and a structural template in terms of residueresidue contacts, hydrophobicity, etc. Threading methods generally require more complicated algorithms to deal with the residueresidue contact term. Previous studies have shown that each approach has its own strengths and weakness. For example, a threading-based method such as THREADER (Jones, 1999
) performs worse in homology recognition at the family and superfamily levels than a sequence-based method, while it achieves better performance at the fold level recognition (Lindahl and Elofsson, 2000
). Motivated by such observations, researchers have attempted to combine both approaches (Jones, 1999
; Kelley et al., 2000
; Panchenko et al., 2000
; Shi et al., 2001
), although finding an optimal way to fully take advantage of both approaches turned out to be difficult (Lindahl and Elofsson, 2000
).
The biggest disadvantage of threading-based methods is that they are computationally expensive when attempting to solve the sequencestructure alignment problem rigorously. It has been proven that the threading problem is NP-hard under some generalized assumptions (Lathrop, 1994). Hence, most of the existing threading methods employ heuristic approaches to avoid computational difficulty at the expense of performance accuracy. These methods include double dynamic programming (Jones et al., 1992
), frozen approximation (Godzik et al., 1992
) and the Monte Carlo sampling algorithm (Bryant, 1996
). By using a divide-and-conquer algorithm, PROSPECT (Xu et al., 1998
; Xu and Xu, 2000
) guarantees to find a globally optimal threading alignment, and it runs fast enough for individual protein structure predictions. However, its computational cost becomes prohibitively large for genome-scale applications.
A number of other threading programs have also been developed with reduced computational cost. For example, GenTHREADER (Jones, 1999) uses a classical sequence alignment algorithm to generate query-template alignments, and then evaluates the alignments by a threading potential; it provides a confidence measure for each predicted fold recognition using a neural network. The program 3D-PSSM (Kelley et al., 2000
) encodes the 1D and 3D profiles, based on multiple sequence alignments among proteins of the same superfamily, into each residue position of each template protein; it finds an optimal alignment between a query sequence or its sequence profile and each template structure by matching their sequence profiles. FUGUE (Shi et al., 2001
) represents one of the better performing threading programs currently available. One of its unique features is that it utilizes environment-specific amino acid substitution tables and structure-dependent gap penalties. Some other groups (Rychlewski et al., 2000
; Yona and Levitt, 2002
) have applied sequence profileprofile alignment algorithms rather than the traditional sequencesequence or sequenceprofile alignment algorithms.
We have recently developed a new threading algorithm which utilizes sequence, structure and evolutionary information and tries to combine all this information in an optimal way. The algorithm employs a new method for profileprofile alignment. One observation we have made is that though pairwise contact potentials help significantly in structural fold recognition, their contribution to threading alignment accuracy is fairly minimal. Considering the high computational complexity of a rigorous threading algorithm, we have adopted an efficient threading procedure. We first solve the threading problem, without considering the pairwise contact potential, using a dynamic programming approach. Then we do fold recognition, based on the full threading potential (including the pairwise potential) of the computed threading alignment without using pairwise potential. Our test results indicate that the performance of this new version of PROSPECT II is better or significantly better than some of the popular fold recognition methods, across all levels of structural similarity, ranging from families, to superfamilies to fold families as defined by SCOP (Murzin et al., 1995).
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Energy functions
PROSPECT II has three knowledge-based energy terms: mutation, singleton and pairwise energies, plus a gap penalty function and a secondary structure match function. The overall threading score is measured by the following function:
Etotal = mEmutation +
sEsingleton +
pEpairwise +
gEgap +
ssEss(1)
In the previous version of PROSPECT, we used the PAM250 matrix (Gonnet et al., 1992) to measure a mutation energy, Emutation, as follows:
where (at, aq) is the aligned amino acid pair of the template and query sequences, and M*() is the PAM250 matrix. In PROSPECT II, the matrix M(at, aq) is replaced by the profileprofile alignment score, which will be described in the next section.
The singleton energy Esingleton measures the overall preference of each amino acid of the query sequence to its aligned structural environment in the template, defined in terms of the secondary structure (ss) and the solvent accessibility (sa) (Bowie et al., 1991). It is the sum of the singleton energy parameters, es(ai, ssi, sai), over all query amino acids aligned to the template sequence:
where the sum is taken over all the aligned amino acids of a query protein, and ssi and sai are the secondary structure and the solvent accessibility of the template amino acid aligned to the query amino acid type, ai, respectively. In PROSPECT II, the singleton energy is different from the previous version of PROSPECT in two aspects. First, we use a homolog-averaged singleton energy, rather than using simple summation of parameters as in Equation 3. The detailed description is given in the next section. Secondly, we employ the following function to derive the singleton energy parameters es(ai, ssi, sai) from the protein structure library:
where p(ai, ssi, sai) is the probability of finding an amino acid type ai at the structural environment characterized by the secondary structure type ssi and the solvent accessibility type sai, and p0(ai, ssi, sai) is a similar probability under the assumption that an amino acid ai has no structural preference. We used nine structural categories with three secondary structure states (helix, strand, loop) and three solvent accessibility states (buried, intermediate, exposed).
Following the widely used convention, H (-helix), G (310-helix) and I (
-helix) are classified as helices, and E (extended strand) and B (residue in isolated ß-bridge) states are classified as strand. All the other states are considered as loop. The boundary between different solvent accessibility levels was decided in such a way that the number of residues in the database is equally distributed within each level. The results are sai
7% for the buried state, 7% < sai
37% for the intermediate state and sai > 37% for the exposed state. It should be noted that the probability ratio p(ai, ssi, sai)/p(ai)p(ssi, sai) can be rewritten as p(ai|ssi, sai)/p(ai), where p(ai|ssi, sai) is the conditional probability of finding an amino acid type ai at a site with the secondary structure type ssi and the solvent accessibility type sai. In the previous PROSPECT (Xu and Xu, 2000
), p0(ai, ssi, sai) is given by p(ai)p(ssi)p(sai). This form assumes the statistical independence between the secondary structure and the solvent accessibility. Although the old form has no adverse effect on the alignment accuracy, one consequence is that it gives biased scores towards or against certain spatial arrangements of secondary structure units in the templates. As we can see in Figure 1, where the ssi- and sai-dependent expectation values of the singleton energy given by
p(ai)es(ai, ssi, sai) are plotted, a template with an exposed ß-sheet or a buried loop region will always have higher singleton energy regardless of a query sequence, while a template with a buried ß-sheet or an exposed loop will tend to be picked as a favorable template regardless of a query sequence. On the other hand, the new singleton energy terms have more or less constant expectation values. The probabilities, p(ai, ssi, sai), p(ai) and p(ssi, sai), were estimated based on data from a non-redundant set of known protein structures. We used the FSSP database (Holm and Sander, 1996
) for this purpose. Among 2689 protein structures in the database, we only used 2145 monomeric proteins by excluding the protein structures that were obtained by NMR experiments.
|
where the sum is taken over the pairs of residues, (i, j). In PROSPECT II, the pairwise energy parameters ep(ai, aj; r) are replaced by the homolog-averaged pairwise energy parameters, which are described in the next section. Only the pairs that are separated by at least three amino acids in the sequence are considered. The distance r between residues is measured by the distance between Cß atoms. Unlike the previous version of PROSPECT where a single cut-off distance rc = 7 Å is used, the current pairwise potentials are distance-dependent. The distance is divided into four intervals, 5, 7, 9 and 11 Å, and any pair with r > 11 Å is ignored. In Figure 2, several pairs of amino acids are plotted. As expected, as the distance increases, the pairwise potentials converge to zero regardless of amino acid type. The pairwise energy parameter, ep(i, j), is the log ratio of two probabilities:
|
where p(i, j; rc) is the probability of having a pair of amino acids, i and j, within the cut-off distance, rc, and p0(i, j; rc) is the similar probability but under the assumption that two amino acids, i and j, are mutually independent.
Given the protein structure database and the cut-off distance, deriving p(i, j; rc) is straightforward; by counting the number of amino acid pairs i and j within the cut-off distance, M(i, j; rc), and the total number of pairs, Mtotal, in the database. Then the probability is simply estimated by the ratio, M(i, j; rc)/Mtotal However, to estimate the background probability p0(i, j; rc), some special care needs to be taken. Let ni, nj and N be the number of amino acid types i, j and the total number of amino acids in the database, respectively. Then, the number of pairs of i and j is ninj if i j, ninj/2 if i = j, and the total number of pairs N(N 1)/2
N2/2; therefore, the probability p(i, j) = 2p(i)p(j) if i
j, p(i)p(j) if i = j where p(i) = ni/N. Assuming that the residues are uniformly distributed in a protein, p0(i, j; rc) is given by
where M(rc) is the total number of pairs within the cut-off distance. The same FSSP database that we used for the singleton energy parameter was also used for estimating the pairwise energy parameters.
It has been suggested (Lu and Skolnick, 2001) that the distance-dependent pairwise potential is more effective for fold recognition than the distance-independent one. However, it is unclear how much improvement we can get from the distance-dependent pairwise potential. In Figure 3, the pairwise energies estimated by the distance-dependent and distance-independent potentials with a cut-off distance of 7 Å are shown for the alignments between the query sequence 1cpca and FSSP templates. The correlation between the two data sets seems fairly high. This finding is consistent with the fact that the distance-independent pairwise potential with a 7 Å cut-off describes the pairwise interaction reasonably well (Xu and Xu, 2000
). Nonetheless, we use the distance-dependent pairwise potential because recent work on the accuracy of statistical potentials for fold assessment found that the performance of the distance-independent potentials depends on the size of the proteins (Melo et al., 2002
).
|
Utilizing evolutionary information
It is generally accepted that homology detection can be improved by utilizing multiple sequence alignment information (Gribskov et al., 1987; Henikoff and Henikoff, 1997
; Karplus et al., 1999
; Panchenko et al., 2000
; Rychlewski et al., 2000
; Yona and Levitt, 2002
), which provides information about structural and functional relationships within a group of related proteins, such as an evolutionarily conservative region or conserved hydrophobicity patterns. This information can be represented by a variety of formulations, such as motifs, profiles (Gribskov et al., 1987
), position-specific score matrices (PSSM) (Altschul et al., 1997
) and HMMs (Sjolander et al., 1996
). Regardless of the formulation, the essential idea is to use the common sequence features among proteins of the same (super)family, rather than a single protein, expressed in the form of position-dependent scores. If a certain amino acid is highly conserved at a particular position, that amino acid is assigned a highly favorable score. On the other hand, for a weakly conserved position, a value close to zero is assigned to all the amino acid types.
PSI-BLAST is the most popular program for generating such information (Altschul et al., 1997). It compares a query sequence against the protein database using the gapped BLAST program to construct a multiple sequence alignment, and then generates a PSSM. The PSSM is then used to identify additional similar sequences, and these are again used to update the PSSM. This process is iterated by a user-specified number of times or until it converges. Given the multiple sequence alignments for a protein of length L, PSI-BLAST constructs an Lx20 PSSM by taking the logarithm of the ratio between the estimated probability for residue j to be found at the position i(pij) and the probability with which it would be found by chance (pj, position-independent):
where 1 i
L is the residue position, and 1
j
20 the amino acid type. A collection of these probabilities in the form of an Lx20 matrix, here named a frequency matrix, is an efficient representation of the protein family to which a query protein belongs. The frequency matrix for a query protein and the PSSM for a template are created by running PSI-BLAST. We use these two matrices to implement the idea of profileprofile alignment and the threading potential energies averaged over homologs.
One way to utilize the evolutionary information is to calculate the mutation energy in our scoring scheme in the same way as in sequenceprofile alignment: a simple lookup of the matrix element in the PSSM corresponding to the amino acid at the aligned position in the sequence. It can be done either by constructing the PSSM for the templates and aligning them against a query sequence or vice versa. An alternative way, as demonstrated in another work (Rychlewski et al., 2000), is to compare two protein families by constructing the PSSMs for both templates and query proteins, and performing profileprofile alignment.
In this work, we define the score as a dot product between the profile vector (column vector of PSSM corresponding to the aligned position) of the template and the frequency vector (column vector of the frequency matrix corresponding to the aligned position) of a query protein:
where mij is the mutation score for the alignment between the position i of the template protein and the position j of a query protein, sik the PSSM of a template and pjk the frequency matrix of a query protein. It should be understood that Equation 10 is a natural extension of the score for the sequenceprofile alignment. The mutation score defined by Equation 10 is essentially the same formula described in a previous paper by Fischer (Fischer, 2000). We also have tried the method of Rychlewski and co-workers (Rychlewski et al., 2000
). We have found that Equation 10 shows slightly better performance. As stated earlier, the frequency vector contains the weighted frequencies of occurrence of 20 amino acids at a particular position among the protein family members. Therefore, mij in Equation 10 is the averaged score over 20 amino acids weighted by their probability of occurrence among the protein family members. In other words, the profileprofile alignment score is the mutation score averaged over homologs.
The same idea of average over homologs can be applied to the singleton and pairwise energies. Recent theoretical and computational studies (Finkelstein, 1998; Reva et al., 1999
; Cui and Wong, 2000
) suggest that the energy averaged over a set of homolog sequences can improve protein fold recognition. The reason for this improvement is that the potential energy parameters are inevitably noisy; the fact that the homologs share a common fold implies that for a correct fold, each homolog tends to have a similar (reasonably good) energy, and for an incorrect fold, the energies for homologs tend to be random. Therefore, the sum of the energies tends to be cancelled out for an incorrect fold, whereas they are constructively interfered for a correct fold. One advantage of using Equation 10 is that it removes the need to run many threadings arbitrarily for many homologous sequences and therefore is computationally efficient.
Moreover, it is superior in quality because the frequency matrix is constructed in such a way that it optimally represents the protein family by not only counting the actual frequencies in multiple sequence alignments but also considers the prior information on amino acid mutation propensities.
The singleton energy averaged over homologs is given by
where the summation index i includes all the aligned sequences of a query protein, and ssi and sai are the secondary structure and the solvent accessibility of the template sequence aligned to the query sequence position i, respectively. The calculated E value is the average across the homologs, as indicated by ha. If the position-dependent singleton energy parameters esi(ssi, sai) =
pijes(j, ssi, sai) are pre-calculated, no additional computational cost is introduced in this formulation. The expression for the pairwise energy averaged over homologs can be derived in a similar way:
The position-dependent pairwise energy parameters eijp =
pikpjlep(k, l) are the pairwise energies for the pair of query sequences (i, j) when they are aligned to the template sequences that are closer than the cut-off distance. These parameters can also be pre-calculated. As discussed in a later section, these homolog-averaged potential energies not only improve the fold recognition but also greatly increase the alignment accuracy.
Templates and threading algorithms
PROSPECT II uses, as the template library, both the protein chains from the FSSP database and the protein domains from the DALI domain library. Currently, the FSSP database is the default in PROSPECT II. Each template was derived from a chain (FSSP) or a domain segment (DALI) in a PDB file, and contains four pieces of information: (i) sequence; (ii) secondary structure types (helix, strand and loop) assigned by the DSSP package (Kabsch and Sander, 1983); (iii) solvent accessibilities (buried, intermediate and exposed states) determined from the percentage of exposed solvent-accessible surface area of a residues side chain according to the DSSP package; and (iv) Cß atom coordinates. The template file also contains the information on whether the residues are in the core region (
-helix and ß-strand) or in the loop region. It is generally believed that the core region is more conserved among the structures of the same fold than the loop region. Some previous work has utilized this property in the form of a structure-dependent gap penalty (Shi et al., 2001
) or by allowing no gaps within a core region (Xu and Xu, 2000
) The information on the core region is utilized in two ways. First, only the pairwise interactions between the core regions are taken into account. Secondly, in the divide-and-conquer algorithm, we assume that no gap can occur within a core region. Not all the
-helix and ß-strand regions are assigned to be a core region. They should meet the minimum length (currently four residues) requirement.
The threading method of PROSPECT II consists of two algorithms: the dynamic programming algorithm and the divide-and-conquer algorithm. When the pairwise energy is ignored, the dynamic programming algorithm is used to calculate an optimal alignment. We employed both the global and globallocal alignment schemes. The global alignment is the default for the FSSP chain library, and the globallocal alignment, where the start and end gaps for a query sequence are not penalized, is set as the default for the domain library. If the pairwise interaction terms are included, the divide-and-conquer algorithm is used (Xu and Xu, 2000). The algorithm solves the alignment problem by recursively solving a series of sub-alignment problems between sub-structures and sub-sequences, and then combining these sub-alignments in a consistent and optimal way (Xu et al., 1998
; Xu and Xu, 2000
).
z-Score, fold recognition and confidence index
The z-score is the threading score in standard deviation units relative to the average of the threading score distribution of random sequences with the same amino acid composition and sequence length as a query sequence (Bryant and Altschul, 1995). In practice, the average and the standard deviation are estimated by repeated threadings between a template and a large number of randomly shuffled query sequences. A combined z-score zc is given by
zc = z0 + wzpzp(13)
where z0 is the z-score for the threading energy without the pairwise energy, and zp the z-score for the pairwise interaction energy. A weighting factor wzp is chosen to maximize the fold recognition performance.
In order to objectively access the performance of PROSPECT II on fold recognition, we use the specificitysensitivity plot (Lindahl and Elofsson, 2000). The specificity is defined by
where TP(zc) and FP(zc) denote the numbers of true positives and false positives with a cut-off z-score zc, respectively. It measures the probability that a pair with a z-score greater than a certain cut-off z-score (hit) is a related protein pair at each similarity level. The sensitivity is defined by
where FN(zc) is the number of false negatives with a cut-off z-score zc. It is the fraction of the number of true positives identified relative to the total number of positives.
One of the most important requirements for the protein structure prediction method is the ability to tell the reliability of a prediction. It is even more critical for genome-scale applications. One obvious indicator is the z-score; the higher the z-score, the more reliable a prediction is. In this work, instead of using the z-score directly, the more informative confidence index is used. The confidence index is defined as the probability of a sequencetemplate pair with a certain z-score being a related protein pair. They are estimated by running a large number of threadings and by counting the number of true positives as a function of the z-score.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Choosing an optimal set of weighting factors for the energy terms is crucial to the performance of PROSPECT II. Our search program for the optimal parameter values employed the orthogonal array method (Sun et al., 1999), as we did in our previous work (Xu and Xu, 2000
). The performance is measured from two aspects: alignment accuracy and fold recognition.
Based on the trials, we found that the weights were not very sensitive, i.e. if weight factors were in a reasonable range, the performance became more or less the same. The dominant term (weight) is the mutation energy. Other energy terms may affect 10% of performance. It is likely that the optimal parameters for the alignment accuracy are not necessarily optimal for the fold recognition. We have tried to find a set of parameters optimal for both. During this process, however, we have found that an increase in alignment accuracy tends to increase the fold recognition accuracy. Therefore, our strategy is that we optimize the parameters in Equation 1 to achieve the optimal alignment accuracy. The optimality for the fold recognition is achieved by adjusting the weighting factor in the formula for the combined z-score given by Equation 13. We used the same training set that had been used in our previous work (Xu and Xu, 2000
), consisting of 174 structurally aligned protein pairs that have <30% sequence identity between them. The alignment accuracy was measured by comparing the alignments with the structural alignments generated by SARF (Alexandrov, 1996
). Shifts from the exact alignments within four residues are counted as correct alignments. To check if the optimized parameters were over-trained for the training set, the alignment accuracy was calculated for the test set consisting of 137 protein pairs, the same set as used in our previous work. To further evaluate the alignment accuracy, the benchmark set prepared by Sippls group (prosup set) (Domingues et al., 2000
) was tested. In order to optimize the weighting factor wzp for the fold recognition, and to gather the statistics to derive the confidence index, we randomly chose 600 query proteins from the FSSP template library and ran threadings against 2857 templates. Each pair of 600x2856 = 1 713 600 query-template pairs excluding self-threadings was classified as sharing the family, superfamily or fold level similarity if two proteins share any domain with the same SCOP number at the family, superfamily or fold level. Otherwise, they were classified as unrelated proteins. The weighting factor wzp was chosen to minimize the number of erroneous cases where unrelated templates were ranked at the top position. To evaluate the fold recognition performance of PROSPECT II, we tested our method on benchmark sets created by Fischer (Lindahl set) (Fischer, 2000
). Finally, we measured the accuracy of predicted protein structures made by PROSPECT II, and compared the performance of PROSPECT II with that of other methods. We tested our method on 260 proteins in the common subset 3 of the EVA threading benchmarking server (Eyrich et al., 2001
).
Alignment accuracy
Before searching for the optimum combination of the weighting factors, we examined the accuracy of each energy term and its relative importance in determining the alignment accuracy. We calculated the optimal alignment accuracy for the training set by only using a single energy term (mutation, singleton or pairwise) and adjusting the gap penalty. We also calculated the same alignment accuracy for the test and prosup sets. The results are summarized in Table I.
|
The threading experiments with a single energy term suggest several points. First, we have learned that the profile information greatly increases the alignment accuracy, 10% increase in all cases. Secondly, the profileprofile alignment energy is the most important factor; therefore, the quality of the profile is crucial for the alignment accuracy. Thirdly, the pairwise energy term is least helpful. These findings are consistent with the previous finding (Xu and Xu, 2000
) that although the pairwise energy increases the correct fold recognition rate, its effect on the alignment accuracy is rather small.
The fact that the pairwise energy term is least helpful in determining alignment accuracy suggests an efficient threading strategy: the pairwise energy can be ignored during searching for the optimal alignment. After an optimal alignment is found, pairwise energy is evaluated based on the alignment and used for the fold recognition. We have searched for the optimal combination of weighting factors with and without the pairwise energy term systematically using a combination of exhaustive searching and the simulated annealing strategy. The best alignment accuracies we found for both cases are nearly identical at 80%. Based on this observation, we decided to adopt the conventional dynamic programming algorithm without the pairwise interaction to calculate the optimal alignment. After the optimal alignment is found, the pairwise interaction energy is evaluated using distance-dependent pairwise energy parameters. The results are shown in Table II. Similar levels of performance on both training set and test set indicate that the optimal parameters are not over-trained.
|
Fold recognition
The fold recognition performance of PROSPECT II, after we optimized the parameter wzp at 0.2, is shown in Table III. Compared with the performance only with z0, the combined z-score zc increases the success rate by 1% at the family level and 4% at the fold level (no change at the superfamily level). It is not surprising that the combined z-score is most helpful at the fold level similarity because the fold level is characterized by major structural similarity (Murzin et al., 1995
). As expected, the fold recognition performance varies with the level of similarity, as high as 85.0% at the family level and as low as 27.9% at the fold level.
|
In Figure 4, we plot z-score versus the specificity and the sensitivities at three levels of similarity. The specificity is the fraction of the related protein pairs among all the protein pairs with a higher z-score than a specified threshold z-score. It represents the confidence level of our prediction that two proteins with a z-score higher than a threshold z-score are related at least at the fold level. At threshold z-scores of 6, 8, 10 and 12, false-positive rates are <40, <10, <4 and <1%, respectively. In Figure 5, the number of pairs with different levels of similarity are plotted as a function of z-score. It can be seen that the majority of pairs with a z-score higher than 20 are in the same family. Between a z-score of 12 and 20, approximately the same number of pairs with the family and superfamily level similarities are found. Below a z-score of 12, a considerable number of false positives begin to appear and finally become larger than the number of true positives, at 8. From these results, the confidence index, defined by the ratio between the true positives and the number of identified pairs within a given z-score interval, category and a possible similarity level as a function of the z-score were decided and are shown in Table IV.
|
|
|
The Lindahl set consists of 976 protein sequences. The performance on all against all comparison of these 976 sequences is measured at three different similarity levels: family, superfamily and fold. The results are summarized in Table V. It can be seen that the performances of PROSPECT II at all similarity levels are significantly better than all other methods. At the family level, all methods except THREADER perform well, and the performance of PROSPECT II is slightly better than that of the second best-performing method, FUGUE. At the superfamily and fold levels, PROSPECT II performs >10% better than any other method. It is noteworthy that although the focus of PROSPECT II is on detecting the family/superfamily level similarity it also performs well in detecting the fold level similarity, better than THREADER. The unique feature of PROSPECT II is that it performs well across all similarity levels. This is in contrast to the fact that FUGUE performs best at the family/superfamily levels while THREADER performs best at the fold level. However, it should be pointed out that Table V may not show the accurate performances of the up-to-date versions of the methods because the data were directly taken from Shi et al. (Shi et al., 2001).
|
|
In this section, we describe the accuracy of predicted protein structures made by PROSPECT II. We compare the performance of PROSPECT II with those of other methods. The prediction models used in our comparison are simply the coordinates of backbone atoms of residues in template pdb files that are aligned to a query sequence. No further refinement was made. The prediction accuracy was measured by the CE z-score (Shindyalov and Bourne, 1998) between the models and the actual structures. In order to make an objective performance comparison between our method and the most up-to-date versions of other methods, we tested our method on the common subset 3 of the EVA threading benchmarking server (http://cubic.bioc.columbia.edu/eva/fr/common_lvl1n1.htmlcommon_set3). EVA (Eyrich et al., 2001
) is a web server for assessing protein structure prediction methods in an automated fashion. The common subset 3 is comprised of 260 proteins. The predictions by other servers (FUGUE, 3D-PSSM and BLAST) were made between October 2000 and May 2002, implying that the results of other methods are reasonably close to the performances of the most up-to-date versions of other methods. Furthermore, to make the comparison as objective as possible, for each target, we selected as a top hit a template with the highest z-score whose pdb submission date is at least one year earlier than the target. Our method still predicted 95 targets as significant (CE z-score >3.7) predictions. In comparison, the best-performing server according to the web page, FUGUE, made 51 significant predictions. This test demonstrates that our method performs significantly better than any other method. In the ongoing structure prediction at the EVA site (http://cubic.bioc.columbia.edu/eva), PROSPECT II has been among the best performers in the threading category.
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Acknowledgements |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Bowie,J.U., Luthy,R. and Eisenberg,D. (1991) Science, 253, 164170.[ISI][Medline]
Bryant,S. (1996) Proteins: Struct. Funct. Genet., 26, 172185.[CrossRef][ISI][Medline]
Bryant,S.H. and Altschul,S.F. (1995) Curr. Opin. Struct. Biol., 5, 236244.[CrossRef][ISI][Medline]
Cui,Y. and Wong,W.H. (2000) Phys. Rev. Lett., 85, 52425245.[CrossRef][ISI][Medline]
Domingues,F.S., Lackner,P., Andreeva,A. and Sippl,M.J. (2000) J. Mol. Biol., 297, 10031013.[CrossRef][ISI][Medline]
Eyrich,V.A., Marti-Renom,M.A., Przybylski,D., Madhusudhan,M.S., Fiser,A., Pazos,F., Valencia,A., Sali,A. and Rost,B. (2001) Bioinformatics, 17, 12421243.
Finkelstein,A.V. (1998) Phys. Rev. Lett., 80, 48234825.[CrossRef][ISI]
Fischer,D. (2000) Pac. Symp. Biocomput., 119130.
Fischer,D., Elofsson,A., Bowie,J.U. and Eisenberg,D. (1996) In Hunter,L. and Klein,T. (eds), Biocomputing: Proceedings of the 1996 Pacific Symposium. World Scientific, Singapore, pp. 300318.
Fitch,W.M. and Smith,T.F. (1983) Proc. Natl Acad. Sci. USA, 80, 13821386.[Abstract]
Godzik,A., Skolnick,J. and Kolinski,A. (1992) J. Mol. Biol., 227, 227238.[ISI][Medline]
Gonnet,G.H., Cohen,M.A. and Benner,S.A. (1992) Science, 256, 14431445.[ISI][Medline]
Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Proc. Natl Acad. Sci. USA, 84, 43554358.[Abstract]
Henikoff,S. and Henikoff,J.G. (1997) Protein Sci., 6, 698705.
Holm,L. and Sander,C. (1996) Science, 273, 595602.
Jaroszewski,L., Rychlewski,L. and Godzik,A. (2000) Protein Sci., 9, 14871496.[Abstract]
Jones,D.T. (1999) J. Mol. Biol., 287, 797815.[CrossRef][ISI][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 25772637.[ISI][Medline]
Karplus,K., Barrett,C., Cline,M., Diekhans,M., Grante,L. and Hughey,R. (1999) Proteins, Suppl 3, 121125.
Kelley,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 499520.[ISI][Medline]
Lathrop,R.H. (1994) Protein Eng., 7, 10591068.[Abstract]
Lindahl,E. and Elofsson,A. (2000) J. Mol. Biol., 295, 613625.[CrossRef][ISI][Medline]
Lu,H. and Skolnick,J. (2001) Proteins, 44, 223232.[CrossRef][ISI][Medline]
Melo,F., Sanchez,R. and Sali,A. (2002) Protein Sci., 11, 430448.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Panchenko,A.R., Marchler-Bauer,A. and Bryant,S.H. (2000) In Quantitative Challenges in the Post-Genome Sequence Era: a Workshop and Symposium. The La Jolla Interfaces in Science, La Jolla, CA, p. 2.
Reva,B.A., Skolnick,J. and Finkelstein,A.V. (1999) Proteins, 35, 353359.[CrossRef][ISI][Medline]
Rost,B. and Sander,C. (1993) J. Mol. Biol., 232, 584599.[CrossRef][ISI][Medline]
Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Protein Sci., 9, 232241.[Abstract]
Shi,J., Blundell,T.L. and Mizuguchi,K. (2001) J. Mol. Biol., 310, 243257.[CrossRef][ISI][Medline]
Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739747.[Abstract]
Sjolander,K., Karplus,K., Brown,M., Hughet,R., Krogh,A., Mian,I. and Haussler,D. (1996) Comput. Appl. Biosci., 12, 327345.[Abstract]
Sun,Z., Xia,X., Guo,Q. and Xu,D. (1999) J. Protein Chem., 18, 3946.[CrossRef][ISI][Medline]
Waterman,M.S. (1995) Introduction to Computational Biology. Chapman and Hall, London.
Xu,Y. and Xu,D. (2000) Proteins: Struct. Funct. Genet., 40, 343354.[CrossRef][ISI][Medline]
Xu,Y., Xu,D. and Uberbacher,E.C. (1998) J. Comput. Biol., 5, 597614.[ISI][Medline]
Yona,G. and Levitt,M. (2002). J. Mol. Biol., 315, 12571275.[CrossRef][ISI][Medline]
Received March 20, 2003; revised June 28, 2003; accepted July 8, 2003.