Fold recognition aided by constraints from small angle X-ray scattering data

Wenjun Zheng1,2,3 and Sebastian Doniach1

1Departments of Physics and Applied Physics and Laboratory for Advanced Materials, Stanford University, CA 94305 and 2Laboratory of Computational Biology, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA

3 To whom correspondence should be addressed. E-mail: zhengwj{at}helix.nih.gov


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
We performed a systematic exploration of the use of structural information derived from small angle X-ray scattering (SAXS) measurements to improve fold recognition. SAXS data provide the Fourier transform of the histogram of atomic pair distances (pair distribution function) for a given protein and hence can serve as a structural constraint on methods used to determine the native conformational fold of the protein. Here we used it to construct a similarity-based fitness score with which to evaluate candidate structures generated by a threading procedure. In order to combine the SAXS scores with the standard energy scores and other 1D profile-based scores used in threading, we made use both of a linear regression method and of a neural network-based technique to obtain optimal combined fitness scores and applied them to the ranking of candidate structures. Our results show that the use of SAXS data with gapless threading significantly improves the performance of fold recognition.

Keywords: fold recognition/linear regression/neural network/small angle X-ray scattering


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
With the explosive increase in DNA and protein sequences resulting from the fast progress of large-scale gene sequencing projects (The Genome International Sequencing Consortium, 2001Go; Venter et al., 2001Go), the gap between known protein sequences and known structures is widening dramatically. This has led to the establishment of a number of large-scale structural genomics projects (Burley, 2000Go) for the determination of protein structures with high throughput under the support of the Protein Structure Initiative (PSI; see Stevens et al., 2001Go). The initiative is targeted at the determination of structures of a minimal set of proteins which could putatively exhaust the universe of all protein folds. Once this goal is achieved, it is believed that the task of protein structure prediction given an unknown sequence would be reduced to the selection of the correct fold from a complete fold library, where a generalized fold recognition strategy which exploits maximal information (both sequence-based and structure-based) might be expected to provide an ultimate solution to the sequence–structure mapping problem for soluble proteins.

Fold recognition (see review by Marchler-Bauer and Bryant, 1999Go) has been a reasonably effective method by which to identify a probable fold from a fold library for an unknown target protein sequence which has no sequence homologue with a known structure. The standard procedure used is to thread the given sequence on to each candidate fold and evaluate the conformational potential energy which is expected to be minimal for the correct fold (potential based threading). Threading may be done either in gapless mode, where all possible gapless alignments of the target sequence with a given candidate fold are examined, or by making use of multiple sequence alignment using gap penalties, to create an optimal alignment (or alignments) for subsequent energy testing (Jones, 1999Go). Recently, attempts have been made to incorporate more sequence-based structural predictions into the fold recognition protocol (David et al., 2000Go). As an example, a 1D profile consisting of predicted secondary structural assignments and solvent accessibility is employed to do ‘prediction based’ threading (Rost et al., 1997Go). Sequential information derived from multiple sequence alignment is also helpful in improving the performance of fold recognition (Rykunov et al., 2000Go; Williams et al., 2001Go).

Besides using sequence-based predictions of structural information to supplement potential-based threading, an alternative approach by which to improve standard threading procedures is to exploit additional structural information derived from experiments such as circular dichroism spectroscopy, which are relatively easy to do in comparison with full-scale structural determination (i.e. based on X-ray crystallography or NMR). In this paper we report on the application of small angle X-ray scattering (SAXS) data as a way to impose physical constraints on threading-based protein structure prediction.

SAXS measures X-ray scattering from a protein in a relatively dilute solution. Thus the measurement of SAXS profiles avoids the need to crystallize the protein. SAXS yields physical information about the internal pair distribution of a molecule in its native state. Svergun et al. (2002) have shown that, given a SAXS profile that extends to 5 Å resolution, it is possible to reconstruct a map giving approximate 3D locations of all the residues in the protein. Hence, despite limitations in resolution resulting from the orientational averaging of the molecules in solution and from practical signal to-noise ratio limitations resulting from radiation damage effects, we believe this physical information has the potential to reduce false positives which naturally occur in fold identification processes based purely on sequence-based information. Recently, we have for the first time explored the application of SAXS-based physical constraints in improving ab initio protein structure prediction (Zheng and Doniach, 2002Go) and have obtained encouraging results. The present work was motivated by the above preliminary work and was aimed at providing a more comprehensive and in-depth study of this novel method in the context of fold recognition. The following improvements were made compared with the previous work (Zheng and Doniach, 2002Go): first, instead of an empirical combination of the SAXS-based fitness scores with the other scores, we attempted more systematic optimizations of the combined scores; second, we tested this method on a significantly larger set of proteins (see Materials and methods).

Following our previous study, we used SAXS-derived structural information to compute a fitness score which evaluates the similarity in SAXS profile between that of the candidate fold (derived computationally from the C{alpha} representation of the protein) and of the target protein (measured experimentally or simulated computationally). Because SAXS measurements are made on an intact protein (or protein fragment), gapped sequence alignments would not be expected to lead to a strong SAXS similarity (since extra or missing residues in the candidate structure would distort the SAXS profile). Therefore, in this paper we use this score as a supplementary constraint for fold identification that is based on a gapless version of the standard potential energy-based threading procedure. We use both a linear regression-based method (LR) and a neural network-based method (NN) to find optimized combinations of a set of fitness scores. Use of explicit optimization allows us to quantify the performance of the fold identification procedure. We find that the use of an optimized score which includes SAXS information leads to results which are significantly better than those obtained by using each individual fitness score separately and are also significantly better than results obtained by using an optimized combined score without including the SAXS information.

Besides providing an improved fold identification method, the present approach can also be used directly to identify domains which are structurally similar to the target. This is achieved by combining a fold library for fold recognition and a domain library for structural similarity identification. This approach potentially has the capability of recognizing structural homologues or analogues for proteins which are not related by significant sequence similarity.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
A flow chart is shown in Figure 1 to summarize the procedure with each step discussed in this section.



View larger version (18K):
[in this window]
[in a new window]
 
Fig. 1. Flow chart that shows the algorithm of SAXS-aided fold recognition. Each step is described in detail in the text.

 
Selection of training and test sets of sequences

The protein sequences studied were selected from the list in our previous paper (Zheng and Doniach, 2002Go) and from the Rosetta test set from Baker's group at the University of Washington (Simons et al., 1999aGo), after excluding those irregular targets without well-defined secondary structures. These lists cover a variety of fold classes ({alpha}, ß, {alpha}/ß) with sequence lengths that vary between 31 and 172. In total we use 11 proteins in our training set and 62 proteins in our test set, which marks a significant extension to the set of sequences studied in our previous work (Zheng and Doniach, 2002Go).

Generating candidate structures by threading to the Dali domain library

In the Dali Domain Classification (Holm and Sander, 1998Go), each domain is assigned a Domain Classification number DC_lmnp representing the fold space attractor region (l), globular folding topology (m), functional family (n) and sequence family (p). We used the ‘Dali Domain Definitions’ (v3.01) published by Structural Genomics Group at EMBL-EBI in October 2000, which contains 3689 domains with different numbers of DC_lmnp. Given a target protein, we first exclude all domain entries that share the same DC_lmnp number with it because these sequences bear a ≥25% sequence identity with the target. Then we continuously thread the target sequence on to each domain which has longer sequence length and discard residues which do not overlap the target. Thus for a domain with length L1 and a target sequence with length L0 (L1 > L0), L1 – L0 + 1 structural candidates are obtained by threading. A continuous (gapless) threading is not expected to give good residue-wise alignment compared with the dynamic programming-based gapped threading but is much more efficient and sufficient to detect the globally correct folds for most targets we study.

Definition of native-like structures

In order to define a measure of the closeness of a candidate structure to the native structure of a target protein, we define a ‘native-like’ structure as lying in one of three classes, depending on the overall quality of the set of all generated candidates:

  1. A structure with cRMS1 (cRMS of all C{alpha} atoms with respect to the experimental structure, same below) less than 6 Å from the true structure if such structure exists.
  2. A structure with with cRMS0.8 (cRMS of 80% of C{alpha} atoms with respect to the experimental structure, same below) less than 5 Å but which fails to satisfy the criterion for (A), if no structure satisfying (A) exists.
  3. A structure with LGA_Q score >1.9 (LGA is a structural comparison tool capable of detecting partial structural similarity which simple cRMS fails to capture; see the subsection Structural alignment for details), but fails to satisfy the criteria for both (A) and (B), if no structure satisfying (A) or (B) exists.

Prescreening

Before doing full-scale structural evaluation, we perform a simple prescreening using the 1D profile consisting of secondary structural assignments (H for {alpha}-helix, E for ß-strand and X for loop) and HPN-3 letter translation of the sequence (H for hydrophobic, P for polar, N for neutral), where the classification of hydrophobicity follows Huang et al. (1995)Go. The secondary structural assignment of both target and candidate fold is obtained by the DSSP program (available at http://www.sander.ebi.ac.uk/dssp/).

The alignment of 1D profile between profile A and profile B is done as follows, where A and B are two sequences of either H/E/X or H/P/N:

Given a residue position i, the score AlignAB(i, i) is 1 (a match) if there exist j [i – 1, i + 1] and k [i 1, i + 1] so that Aj = Bk; otherwise AlignAB(i, i) is 0.

To define FSS and FHPN, we compute the fraction (F) of matches for the whole alignment of 1D profile. We keep structures which satisfy the following criteria: FSS > 0.6 and FHPN > 0.8.

After prescreening, about 104–105 candidate structures are kept for further evaluation.

Fitness scores evaluation

We use the following fitness scores to evaluate the candidate structures:

1. Combined hydrophobicity and burial score Fhpb. First we define Fhp (HP fitness score; see Huang et al., 1995Go) based on the hydrophobic-polar (HP) model which counts pairs of contacts between hydrophobic residues. We define two residues to be in contact if the distance between their C{alpha} atoms is <7 Å and they are not sequential neighbors.

Then we define Fburial (burial score; see Huang et al., 1995Go), which measures the extent to which hydrophobic residues are buried inside the core. It is computed by summing the number of residues within a 10 Å distance cutoff from every hydrophobic residue.

Finally, we combine the above two scores as

(1)
where <Fhp> is the HP fitness score averaged by sequence permutation.

2. Statistical contact energy Fstat. We define the statistical energy as the sum of statistical pairwise contact energy between any two residues in contact based on the 20 x 20 matrix. The pairwise residue–residue interaction energy is calculated based on the frequencies of tertiary contacts in a given PDB structure database. We use the table given in Dima et al. (2000)Go, which we have found to work better than the table used in our previous paper (Zheng et al., 2002Go).

3. Radius of gyration FRg. We define FRg as the root mean square distance from the center of mass of all C{alpha} atoms along the C{alpha} backbones. This is a useful fitness score for selecting compact structures. Since Rg can be reliably derived from the SAXS data, it is partially overlapping the SAXS score defined later.

4. SAXS fitness score FSAXS. This is defined in the next subsection.

5. 1D profile alignment score: FSS, FHPN. This was defined in the previous subsection.

We make further use of these parameters to construct a combined fitness score in addition to the use in prescreening.

SAXS fitness score evaluation

We adopt the score function used by Walther et al. (2000)Go. The profile of scattering intensity associated with a bead model is given as follows using the Debye equation in its pair-distance histogram form:

(2)
where N is the number of beads, s is the scattering vector with s = k/2{pi}, g(ri) is the pair-distance histogram of all singly counted pairwise distances and the number of bins is nbins. To represent the I(s) profile, we discretize s with ds = 0.002 Å–1 and the maximal s is set to 0.12 Å–1. Profiles are normalized to yield I(0) = 1. The score function or fitness was computed from

(3)
with

(4)
where r is the cross-correlation coefficient between the two scattering intensity curves (IM and IE are the two SAXS profiles computed for the structural model and obtained experimentally, respectively) and w is the weighting factor, chosen to be 10. The term (si/smax)m (m = 3) adds more weight to differences in the tail of the profile (at higher s values). Smaller value of F corresponds to better fits between the experimental and predicted profiles.

Here we simulate IE with all-atom bead model whereas IM is computed based on a C{alpha} atoms only model without explicit consideration of side chain coordinates, assuming side chain atoms sitting at the same coordinate as the C{alpha} atom. This approximation in computing IM may reduce the performance of the SAXS score; however, it also increases the robustness of our approach, which may tolerate some extent of measurement errors.

Structural alignment

CRMS1 and CRMS0.8 We use the standard coordinate RMSD (cRMS) to do structural comparisons between our predicted backbone and the corresponding native C{alpha} backbone (McLachlan, 1971). This is done by superimposing the above two structures on to each other and minimizing the RMS deviation between 100% or 80% of all the residues. We try both the given C{alpha} backbone and its mirror image in the computation of cRMSD and keep the minimum value of cRMS.

LGA The LGA program was developed by Zemla for structural comparative analysis of two protein structures (Zemla, 2003Go). We use LGA to search for the largest (not necessarily continuous) set of equivalent residues between a candidate structure and its native structure deviating by no more than DIST = 5 Å. We use the quality score LGA_Q (Zemla, 2003Go) to assess the structure comparison.

Linear regression

Given a set of N fitness scores Fi (i = 1, 2,..., N), we determine a linearly weighted sum of them (FLR) by fitting the following linear regression model of the form (Simons et al., 1999GobGo):

(5)
where wi are fitting constants independent of targets and w(t) depends on target t.

(6)

We construct a training set of structures: {S(t, j) | 0 ≤ t < T, 0 ≤ j < N} for T targets and N structures per target, then we minimize the following squared error:

(7)
Then wj is obtained by solving the following equation:

(8)
where

(9)
and

(10)
and w(t) is given by

(11)

The A matrix is properly regulated so that it is non-singular and the above linear equation is uniquely solvable.

Multi-layer feed forward neural network

We use a typical three-layer feed-forward neural network (Figure 2) to do fold recognition: the input layer consists of six neurons corresponding to six fitness scores to be compiled for evaluation. The scores are rescaled by a sigmoid function f(x) = 1/(1 + ex) to values between 0 and 1 at the input layer. The hidden layer has five neurons which is sufficient for six input variables and the output layers has two corresponding to ‘positive’ and ‘negative’, respectively. Then we compute the ratio between them and rank the candidates with this ratio P/N: the higher it is, the more favorable is the candidate.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 2. A three-layer feed-forward neural network used for fold recognition. There are six input variables: F1 = FRg, F2 = FSAXS, F3 = Fhpb, F4 = Fstat, F5 = FSS,F6 = FHPN. The hidden layer has five neurons and the output layers has two nodes corresponding to ‘positive’ and ‘negative’, respectively.

 
The computation at each neuron is done as follows: first compute the weighted sum of all input values from the upstream layer (each link is associated with a weight), then apply the sigmoid function and output the result to the downstream layer.

The training is performed using the standard back-propagation algorithm and all link-associated weights are adjusted as a result of the learning process. The training set is composed of 11 proteins from Set (A). For each protein from the training set, 5000 candidate structures are extracted from its set of all candidates as ranked by their cRMS1, which includes all the native-like candidates with cRMS1 < 6 Å. The choice of 5000 results from a tradeoff between computing efficiency and the diversity of training data. The target values for both outputs are functions of cRMS1: Positive output is set to 1 if cRMS1 < 4 Å, 0 if cRMS1 > 6 Å and linearly interpolated in between; the negative output is set to 1 minus the positive output.

The learning process goes through the training set multiple times until 90% of the training targets have at least one native-like candidate ranking in top 10 by the ratio P/N. This choice of learning termination criteria ensures sufficient training without over-learning.

The validation of performance is done by running the neural network on a test set of 32 proteins from Set (A).


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
Overview

The method used in this paper consists of the definition of a number of fitness scores with which to assess an alignment of a target sequence with a set of 104–105 candidate structures generated by gapless threading against the set of folds in the Dali domain library. An optimized combination of these fitness scores is then developed by use of two optimization methods, linear regression and neural network based, on a training set of 11 target sequences.

Once these optimized combinations of fitness scores have been generated, we apply them to a set of 62 test sequences for which we generate 104–105 candidates per target sequence. We then assess the performance of the fitness scores taken individually and of their optimized combinations, by computing their average Z-score for the native-like subset of candidate structures (see Materials and methods for the definition of ‘native-like’) and by finding the best Z-score rank of the native-like candidate structures.

As another measure of the effectiveness of the optimized fitness scores, we also determine if at least one of the structures with the top 10 Z-scores are structural neighbors to the target structure, as measured by the Dali structure alignment tool.

Generating candidate structures via gapless threading

To generate a set of candidate structures for training and evaluation of our fitness scores, we perform gapless threading of each of the target sequences against the Dali domain library (Holm and Sander, 1998Go), by a procedure which is described in detail in Materials and methods. The results are collected in Table I. This procedure generates sets of 104–105 candidates for each sequence. For small proteins with sequence length <80 these candidate sets are found to contain native-like structures of type (A) (cRMS1 < 6 Å; see Materials and methods). For longer sequences (>90) the candidate sets contain structures with partially good structural alignments of type (C) (detectable by the LGA structure alignment tool with LGA_Q > 1.9, meaning significant structural similarity; see Materials and methods).


View this table:
[in this window]
[in a new window]
 
Table I. Summary of the results of generating candidate structures with gapless threading

 
We divide the targets into three sets according to the quality of ‘native-like’ structures found in the sets generated by our threading protocol, for which the criteria of native-like structures are defined by (A), (B) and (C) as given in Materials and methods. Roughly, the (A) set is relatively easy for selecting native-like structures satisfying cRMS1 < 6 Å which possess complete structural similarity to the native conformation, whereas the (C) set is more difficult as its structural similarity to the native is at most partially good with LGA_Q > 1.9. The (B) set is somewhere in between.

We select 11 proteins from the (A) set to serve as a training set for both the linear regression and neural network procedures. The rest of the targets are used as a test set for evaluating the performance of our SAXS-aided fold recognition protocol. Efforts are made to ensure that no protein in the test set is sequence homologue (with >25% sequence identity) of any protein in the training set.

Z-score evaluation of individual scores

In order to select native-like structures from the set of candidates, we need to define fitness scores (see Materials and methods) that are capable of discriminating them against non-native ones. It is then desirable to combine these scores to optimize the overall performance.

Before exploring the combination of multiple score functions we first study them individually. In total six fitness scores (Fhpb, FRg, FSAXS, Fstat, FSS and FHPN) are used, which are described in detail in Materials and methods. They can be classified into three types: energy based (Fhpb which essentially evaluates how good the hydrophobic resides are buried inside a compact core and Fstat which is a statistical pair-wise contact energy derived from a protein structure database), 1D profile based (secondary structure assignment profile score FSS and hydrophobic-polar profile score FHPN) and SAXS based (FRg and FSAXS). The main purpose of this study is to focus on the evaluation of the SAXS-based scores and their ability, in combination with the other scores, to improve the overall discrimination power of fold recognition.

For a given fitness score F and a given native-like structure s, we can define the following Z-score:

(12)
where <F> is the average of F over the whole set of candidate structures and {sigma}F is its standard deviation. We use the Z-scores averaged over the set of native-like structures to evaluate the performance of a given score function: the more negative it is, the better is its ability to discriminate native-like from non-native structures.

In Table I of the Supplementary data (available at PEDS Online), we list the average and the optimal Z-scores and the best Z-score rank of the native-like structures for each individual fitness score F. One can see that FSAXS (with average Zavg = –0.776) and FRg (with average Zavg = –1.289) do possess a good discrimination power to select native-like structures and that they are comparable to Fhpb (with average Zavg = –0.906) and Fstat (with average Zavg = –0.739). Therefore, SAXS-based scores indeed have the potential to help to improve the selection of native-like structures in combination with the other more standard score functions.

Linear regression: performance evaluation of FLR

To find an optimal linear combination of the individual scores that we have just evaluated, we use the linear regression (LR) method (Simons et al., 1999GobGo), which is a simple and effective way of optimizing linear decision making. The motivation is to minimize the overall square deviation between a linear combination of all scores and a prediction quality function (see Materials and methods).

The coefficients for the optimal linear combination are evaluated for the training set of 11 target proteins by minimizing this function when averaged over those 5000 candidate structures closest in cRMS to each of the targets in the training set as explained in Materials and methods.

To evaluate the significance of SAXS scores in addition to other standard score functions, we run an LR for all the score functions excluding SAXS scores (FSAXS and FRg) and then compare it with the LR results obtained when all score functions are included. Here is a summary of the results.

On average, the addition of the SAXS scores improves the Z-scores of FLR from –2.066 to –2.319. Assuming that FLR follows a Gaussian distribution approximately, then this improvement corresponds to a reduction of the p-value from 0.019 to 0.01 (or roughly by a factor of 2), which is fairly significant.

Out of 11 targets in the training set, 11 (100%) show better FLR performance than any individual score F and 10 (90.9%) show better performance for FLR with SAXS information than without it.

Out of 32 targets in the test set [also from Set (A)], 19 (59.4%) show better FLR performance than any individual F and 24 (75%) show better performance for FLR with SAXS information than without it. Therefore, LR provides a reasonably optimal way of combining multiple fitness scores into one score and manages to get the ‘best of all’ performance in most cases. Furthermore the incorporation of SAXS information improves LR's performance further with high probability (75%). Notably, in most of the cases where FSAXS fails to improve the performance further, FLR has already achieved a good Z-score without SAXS data.

In the light of the significantly better performance of FLR, it is natural to ask how much each individual score contributes to this improvement. To shed some light on this issue, we also show the linear correlation coefficient between each individual score F and FLR which measures the relevance of each F to FLR (Table II). It is evident that FSAXS [average correlation coefficient (c.c.) = 0.367] and FRg (average c.c. = 0.584) correlate better with FLR than the other energy-based scores such as Fhpb (average c.c. = 0.104) and Fstat (average c.c. = 0.100). This suggests that FLR's significant improvement in discrimination of native-like structures is to a substantial extent due to the contribution of SAXS information.


View this table:
[in this window]
[in a new window]
 
Table II. Linear correlation coefficient of Fhpb, Fstat, FSAXS and FRg with FLR

 
We comment that the particularly large contribution of Rg to FLR is largely a consequence of the gapless-threading-based protocol of candidate structures generation, which can easily produce many non-compact structures. We expect Rg to be less discriminating if applied to a set of more compact structures. Meanwhile, the weak contribution of Fhpb and Fstat is probably due to the prescreening which requires a significant matching of the HPN profile between the target sequence and the template sequence.

Neural network: performance evaluation of FNN

Neural networks (NNs) have found extensive application in bioinformatics for their well-known capability of learning complicated patterns of relationships among multiple variables characteristic of biological knowledge of gene sequences and structures. There has been some application of NNs in fold recognition (Jones, 1999Go; Ding and Dubchak, 2001Go). Here we use a typical three-layer feed-forward NN to explore an optimal exploitation of the same six fitness scores used in LR (including SAXS scores). In comparison with LR, which is a typical linear decision procedure, non-linearity is introduced in NNs with the use of the sigma function (see Materials and methods), therefore it is not limited simply to producing a weighted linear combination of the original variables and is thus potentially more flexible in capturing complex patterns. The NN in use has six input variables corresponding to six scores: Fhpb, FRg, FSAXS, Fstat, FSS and FHPN; each is normalized by subtracting the statistical average and then dividing by the standard deviation. There are two outputs, one corresponding to ‘positive’ and the other ‘negative’. To make comparisons with LR's combined score function FLR, we introduce a new score function which is the ratio between the ‘positive’ output and the ‘negative’ one and rank structure candidates with this ratio FNN. Similarly to the evaluation procedure used in FLR, we run the NN training and test with and without SAXS scores for comparison. In Table 1 of the Supplementary data, we list the Z-scores of FNN. The training set for NN is the same as that used for LR. Here is a summary of the results.

On average, the addition of the SAXS scores improves the Z-scores of FNN from –1.550 to –2.033. Again assuming that FNN follows a Gaussian distribution approximately, then this improvement corresponds to a reduction of the p-value from 0.0606 to 0.0212 (or roughly by a factor of 3), which is fairly significant.

Out of 11 targets in the training set, 11 (100%) show better FNN performance than any individual score F and 10 (90.9%) show better performance with SAXS information than without it.

Out of 32 targets in the test set [also from Set (A)], 21 (65.6%) show better FNN performance than any individual F and 24 (75%) shows better performance with SAXS information than without it. Therefore, NN shows a similar improvement to that found for LR and again SAXS is shown to be valuable in helping to improve the performance of the NN.

Testing FLR and FNN in native-like structure selection

After obtaining the optimal compilation of our fitness scores, we tested their performance in discriminating native-like structures from the candidate sets generated by our threading protocol. We list the best Z-score rank of native-like structures in Table III.


View this table:
[in this window]
[in a new window]
 
Table III. Performance evaluation of FLR and FNN in selecting native-like structures and correct structural neighbors (SNs)

 
The results show that we have achieved reasonable success with the selection of native-like structures (cRMS1 < 6 Å): in 8 (8) out of all 11 targets from the training set, at least one native-like structure is ranked in the top 10 by FLR (FNN). In 15 (14) out of all 32 targets from the test set [the rest of set (A)], at least one native-like structure is ranked in the top 10 by FLR (FNN). This suggests a success rate of good prediction to be between 40 and 50% for this protocol. We believe there is still ample room for improvement by using more accurate models that include side chains and other backbone atoms.

Testing FLR and FNN in structural neighbor identification

As an alternative test of the effectiveness of the performance of FLR and FNN, we measured which of the candidate structures in the top 10 of the Z-score ranked structures is also a structural neighbor (SN) of the actual protein as measured by the Dali structure alignment tool (alignment Z-score >2). This is a more challenging task than finding structures with low cRMS because the SNs are more remotely related to the target structure and the simple cRMS1 does not detect the partial structural similarities that are detected by the Dali structural alignment. Since our scores are based mostly on the structure as a whole and are sensitive to possible fragmentation of the structure, their ability to discriminate native-like partial structural features is expected to be weaker.

In spite of this, the results in Table III still show that we have achieved a moderate success with the identification of correct SN's in the top 10 Z-sore candidates: in seven (six) out of all 11 targets from the training set, at least one candidate from a correct SN is ranked in top 10 by FLR (FNN). In 11 (11) out of all the 16 targets for which there exist correct SNs in the set of all candidates from the test set (the rest of set A), at least one native-like structure is ranked in top 10 by FLR (FNN). In 10 (11) out of all 26 targets from the harder test set [Sets (B) and (C)], at least one native-like structure is ranked in the top 10 by FLR (FNN). This suggests a success rate of SN identification to be between 60 and 70% for relatively easy targets, whereas for harder targets it drops to ~40%, which is still reasonably good.

We also give the p-values for the successful cases in Table III to assess the statistical significance of selecting an SN in the top 10. For some of the target proteins, the p-value is relatively high because of the large number of SNs for those proteins; for most others, the p-value is fairly low and suggests high statistical significance.

Compared with the previous test on native-like structure selection, this test is more relevant in the context of functional genomics based on structural homology relations. As is well known, a specific biological function of proteins is in general executed by a limited number of specific structural features (such as an enzyme's binding site) which are only part of the native structure as a whole. Therefore, the conservation of such partial structural features rather than the whole structure is more relevant to the conservation of function. In this context the present SN selection protocol seems to be fairly promising.

Applications of structural neighbor identification

The identification of correct SNs can provide clues to the functional study of a target protein. To illustrate this, we now discuss several such examples for targets we have studied for which correct SNs are selected and where we see interesting functional connections:

  1. In a number of cases, the selected SN is in precisely the same family by sequence homology, for example:
    1. 1r69 (a 434 repressor) and its SN 1b0nA (sinr protein) both belong to helix–turn–helix motif and fulfil DNA binding function;
    2. 1shg ({alpha}-spectrin) and its SN 1griA (growth factor-bound protein) both belong to SH3 domain and are involved in signal transduction;
    3. 1svq (severin) and its SN 1d0znA (horse plasma gelsolin) both belong to gelsolin and are involved in actin binding.
    4. In all of the above cases, the sequence identity is around 20–30% and so falls into the ‘twilight zone’ where sequence alignment does not give clear results.

  2. In several cases, the selected SN is functionally related to the target:
    1. 1csp (cold shock protein) is involved in DNA binding, whereas its SN 1ah9 (initiation factor) has RNA binding property; this suggests that they may both be derived from an ancient nucleic acid-binding protein;
    2. 1leb (lexa repressor DNA binding domain) is involved in the DNA binding function of DNA repair regulation and transcription regulation process, whereas its SN 1ecl (Escherichia coli topoisomerase) participates in the process of DNA topological change and DNA unwinding, so they both share the function of DNA binding;
    3. 1pou (pou-specific domain) is involved in binding to specific DNA sequences to cause temporal and spatial regulation of the expression of genes, whereas its SN 1knyA (kanamycin nucleotidyltransferase) binds to some RNA primer and has a significant homology to the family X of polymerases, so they both share the function of DNA binding.

  3. In a few cases, there is no obvious functional relation between the target and the selected SN but there may exist some undiscovered evolutionary relationship suggesting that it could be worth more effort to clarify such relationships:
    1. 1ctf (ribosomal protein) is involved in protein biosynthesis and its SN 1mla (an acyl carrier protein transacylase) functions as a multifunctional enzyme which participates in fatty acid biosynthesis, but it is not clear how they are related to each other;
    2. 1sro (pnpase fragment) is involved in RNA binding whereas its SN 1a62 (ATPase) is involved in ATP binding; it is noted that 1a62 contains a nucleotide-binding site for ATP and ADP which may be the common sub-structure for both of them;
    3. 2ncm (neural cell adhesion molecule fragment) belongs to immunoglobulin superfamily and may be involved in protein–protein and protein–ligand interactions, whereas its SN 1aac (amicyanin) is involved in copper binding and electron transport; this suggests the possibility of 2ncm binding metallic ions.

In summary, the above examples demonstrate that conservation in protein structures may imply evolutionary relationships and that structurally similar proteins may possibly share similar or related functions. Therefore, by identifying SNs which are structurally similar to a given target, we may gain some insight regarding the biochemical function of the target. Work in this direction is expected to be very fruitful.

Conclusion

We have carried out a systematic study of the use of structural information derived from SAXS measurements to improve fold recognition. The SAXS data for a target protein can serve as a structural fingerprint of its native conformation and can therefore be used to construct a similarity-based fitness score to evaluate candidate structures generated by threading. To combine the SAXS scores with the standard energy scores and other 1D profile-based scores, we have used both a linear regression method and a neural network approach from which we obtain optimal combined fitness scores and apply them to the ranking of candidate structures. Our results show that the use of SAXS scores combined with gapless threading significantly improves the performance of fold recognition. We also demonstrate the effectiveness of this protocol in selecting structural neighbors of target proteins, which can potentially aid the study of their biochemical functions.

The above results support the idea that SAXS-based fitness scores should contain newer structural information than the energy-based scores since the energy scores only take into account of spatially ‘short range’ native contacts (with inter-residue distance <7 Å) whereas the SAXS profile contains distance distribution information up to the size of the protein (although residue identities are not resolved). Indeed, at the angle cutoff of Smax = 0.12 Å–1, the SAXS measurement is able to resolve the shape information (but not the detailed secondary structures). Therefore, besides the compactness information from Rg, the additional filtering capacity of FSAXS is mostly due to the shape information encoded in the SAXS data. Therefore, the performance of FSAXS for a given target protein may depend on the uniqueness of its shape.

To improve the SAXS-aided fold recognition further, it is desirable to replace gapless threading with more sophisticated gapped threading algorithms with inputs from the multiple sequence alignments (e.g. by PsiBlast; see Altschul et al., 1997Go). This will significantly enrich the native-like structures in the generated set of candidate structures compared with those obtained by gapless threading. We note that the threading-derived sequence–structure alignments must be further used to build a set of complete structural models before the SAXS scores can be assessed. This is not a straightforward task and may need ab initio modeling for those parts of the target protein for which no significant alignment with known structures is found.

In addition to the obvious application of this approach in the post-structural genomics age to help in the identification of the structures of specific genome sequences, it also has potential applications in the implementation of structural genomics projects. Given a set of proteins which have been shown by sequence alignment search to lack sequence homology to proteins of known structure, the use of SAXS data as an input, together with a fold recognition protocol, may be applied to identify a significant number of targets with structural similarity to known proteins even though they lack sequence homology. This approach will then help in target prioritization, either by confirming the putative structural homologues or analogues identified by the SAXS-based threading procedure or by suggesting target sequences with hitherto unknown folds. The SAXS-based technique may therefore help in reducing bottlenecks in high-throughput genomics projects by focusing attention on targets of specific biological or structural interest.

For future work, we plan to improve the SAXS-based protocol by using more accurate models which include side chains and other backbone atoms, in combination with experimentally obtained SAXS data, which may be complicated by measurement errors and the effects of hydration.


    Acknowledgements
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
We thank D.Walther for his seminal contributions to the use of the SAXS fitness score. We are grateful to D.Hinds for providing valuable information about the simulation software that he had developed, to A.Zemla for providing the LGA software and to David Baker's group at the University of Washington for providing the Rosetta decoy set. This work is supported by NSF-PHY98. A hardware gift from INTEL is gratefully acknowledged.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Acknowledgements
 References
 
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bernstein,F.C., Koetzle,T.F., Williams,G.J.B., Meyer,E.F.,Jr, Brice,M.D., Rogers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977)J. Mol. Biol., 112, 535–542.[ISI][Medline]

Burley,S.K. (2000) Nat. Struct. Biol., 7, Suppl., 932–934.[CrossRef][Medline]

David,R., Korenberg,M.J. and Hunter,I.W. (2000) Pharmacogenomics, 1, 445–455.[CrossRef][Medline]

Dima,R., Settanni,G., Micheletti,C., Banavar,J. and Maritan,A. (2000) J. Chem. Phys., 112, 9151–9166.[CrossRef][ISI]

Ding,C.H. and Dubchak,I. (2001) Bioinformatics, 17, 349–358.[Abstract]

Holm,L. and Sander,C. (1998) Proteins, 33, 88–96.[CrossRef][ISI][Medline]

Huang,E.S., Subbiah,S. and Levitt,M. (1995) J. Mol. Biol., 252, 709–720.[CrossRef][ISI][Medline]

Jones,D.T. (1999) J. Mol. Biol., 287, 797–815.[CrossRef][ISI][Medline]

Marchler-Bauer,A. and Bryant,S.H. (1999) Proteins, 37, 218–225.[ISI][Medline]

McLachlan,A.D. (1971) J. Mol. Biol., 61, 409–424.[CrossRef][ISI][Medline]

Rost,B., Schneider,R. and Sander,C. (1997) J. Mol. Biol., 270, 471–480.[CrossRef][ISI][Medline]

Rykunov,D.S., Lobanov,M.Y. and Finkelstein,A.V. (2000) Proteins, 40, 494–501.[CrossRef][ISI][Medline]

Simons,K.T., Bonneau,R., Ruczinski,I. and Baker,D. (1999a) Proteins, 3,171–176.

Simons,K.T., Ruczinski,I., Kooperberg,C., Fox,B.A., Bystroff,C. and Baker D. (1999b) Proteins, 34, 82–95.[CrossRef][ISI][Medline]

Stevens,R.C., Yokoyama,S. and Wilson,I.A. (2001) Science, 294, 89–92.[Abstract/Free Full Text]

Svergun,D.I., Petoukhov,M.V. and Koch,M.H. (2001) Biophys. J., 80, 2946–2953.[Abstract/Free Full Text]

The Genome International Sequencing Consortium (2001) Nat. Biotechnol., 409, 860–921.[CrossRef]

Venter,J.C. et al. (2001) Science, 29, 1304–1351.

Walther,D., Cohen,F.E. and Doniach,S. (2000) J. Appl. Crystallogr., 33, 350–363.[CrossRef][ISI]

Williams,M.G. et al. (2001) Proteins, 45, Suppl. 5, 92–97.[CrossRef]

Zemla,A. (2003) Nucleic Acids Res., 31, 3370–3374.[Abstract/Free Full Text]

Zheng,W.J. and Doniach,S. (2002) J. Mol. Biol., 316, 173–187.[CrossRef][ISI][Medline]

Received November 10, 2004; revised March 7, 2005; accepted March 25, 2005.

Edited by Fred Cohen





This Article
Abstract
Full Text (PDF)
Supplementary data
All Versions of this Article:
18/5/209    most recent
gzi026v1
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (1)
Request Permissions
Google Scholar
Articles by Zheng, W.
Articles by Doniach, S.
PubMed
PubMed Citation
Articles by Zheng, W.
Articles by Doniach, S.