Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments

Daniel J. Rigden,1

Embrapa Genetic Resources and Biotechnology, Cenargen/Embrapa, S.A.I.N. Parque Rural, Final W5, Asa Norte, 70770-900, Brasília, Brazil


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Current methods for identification of domains within protein sequences require either structural information or the identification of homologous domain sequences in different sequence contexts. Knowledge of structural domain boundaries is important for fold recognition experiments and structural determination by X-ray crystallography or nuclear magnetic resonance spectroscopy using the divide-and-conquer approach. Here, a new and conceptually simple method for the identification of structural domain boundaries in multiple protein sequence alignments is presented. Analysis of covariance at positions within the alignment is first used to predict 3D contacts. By the nature of the domain as an independent folding unit, inter-domain predicted contacts are fewer than intra-domain predicted contacts. By analysing all possible domain boundaries and constructing a smoothed profile of predicted contact density (PCD), true structural domain boundaries are predicted as local profile minima associated with low PCD. A training data set is constructed from 52 non-homologous two-domain protein sequences of known 3D structure and used to determine optimal parameters for the profile analysis. The alignments in the training data set contained 48 ± 17 (mean ± SD) sequences and lengths of 257 ± 121 residues. Of the 47 alignments yielding predictions, 35% of true domain boundaries are predicted to within 15 amino acids by the local profile minimum with the lowest profile value. Including predictions from the second- and third-lowest local minima increases the correct domain boundary coverage to 60%, whereas the lowest five local minima cover 79% of correct domain boundaries. Through further profile analysis, criteria are presented which reliably identify subsets of more accurate predictions. Retrospective analysis of CASP3 targets shows predictions of sufficient accuracy to enable dramatically improved fold recognition results. Finally, a prediction is made for geminivirus AL1 protein which is in full agreement with biochemical data, yielding a plausible, novel threading result.

Keywords: correlated mutations/covariance analysis/multiple sequence alignment/structural domains/threading


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Domains are the fundamental units of proteins, exhibiting not only structural independence, by definition, but also folding autonomy (Rossmann and Argos, 1981Go). The frequent correspondence of protein domains and gene exons in eukaryotes has facilitated the transfer and duplication of entire protein domains during evolution. This has resulted in many families of modular proteins containing combinations of domains in varying orders and quantities. The fundamental importance of the domain as the basic protein structural unit is acknowledged by its use in hierarchical protein structure classification systems (Murzin et al., 1995Go; Orengo et al., 1997Go).

Largely automated sequence comparison protocols are responsible for several databases, of aligned protein domains such as PFAM (Bateman et al., 2000Go), PROMOD (Park et al., 1998Go), DOMO (Gracy and Argos, 1998aGo,bGo) and SMART (Schultz et al., 2000Go). The assignment of domain boundaries for entries in these databases sometimes originates in a manually-curated `seed' alignment as is the case for PFAM [and now also incorporated into PRODOM (Corpet et al., 2000Go)]. Alternatively, computer analysis is applied based either on the recurrence of similar sequence segments in different proteins at different distances from the N- and C-termini, or on duplicated segments observed in protein sequences (Gracy and Argos, 1998bGo). Hence, accurate domain boundary assignment requires, ideally, structural information, or otherwise the repeated occurrence of a domain in different contexts. A problem therefore arises for protein families which lack relevant structural information and whose structures comprise several domains. If these domains are only observed in a single order [as, for example, the four domains of eukaryotic pyruvate kinases (Larsen et al., 1994Go)], or if sequence comparisons fail to reveal their presence elsewhere, then the current protein domain databases will erroneously assign a single domain to the whole protein.

Knowledge of structural domain boundaries is not just of theoretical interest, but also of great practical importance. For example, conformational heterogeneity is known to impede the crystallization of a protein, the first step towards the determination of its structure by X-ray crystallography. Such heterogeneity is often conferred on proteins comprising domains joined by more or less flexible linker regions. Indeed the literature abounds in proteins crystallized first, or only, as several individual domains (Owen et al., 1995Go; Chan et al., 1996Go). Given prior knowledge of structural domain boundaries, molecular biology techniques could readily be used to produce individual domains that might crystallize more readily than the intact protein (Matthews, 1997Go). Similarly, for the technique of nuclear magnetic resonance (NMR) spectroscopy, structural flexibility hampers, or renders impossible, structure determination. In the case of NMR there is an additional size limit complicating the structural determination of multi-domain proteins. Therefore, it is common for NMR to be applied to domains or pairs of domains, as in the case of fibrin (Sticht et al., 1998Go; Bocquier et al., 1999Go; Potts et al., 1999Go). In the area of comparative protein modelling, it has been shown that threading techniques are highly sensitive to sequence length and work best when supplied with sequences of individual domains, with domain sequence not lacking and additional sequence not present (Fischer et al., 1999Go). Limited proteolysis, sometimes coupled with mass spectrometry, offers an experimental route for the determination of domain structure (Cohen, 1996Go; Bantscheff et al., 1999Go) but suffers from the obvious disadvantage of requiring purified protein. Purely sequence-based methods of domain structure prediction (Kuroda et al., 2000Go; Wheelan et al., 2000Go) have much wider application.

It is well known that the constraints imposed upon side chain size and chemistry by the 3D packing environment can lead to sequence compensation between spatially close residues (Lesk and Chothia, 1980Go). In other words, the presence of a certain amino acid at position x may sometimes only be accommodated if a particular amino acid is present at position y. Hence, analysis of multiple sequence alignments could be used to make predictions about 3D amino acid contacts. This idea has been extensively investigated with the conclusion that a clear but weak signal is present in multiple sequence alignments (Gobel et al., 1994Go; Shindyalov et al., 1994Go; Taylor and Hatrick, 1994Go). Covariance analysis has since been used for of ab initio protein structure prediction (Orengo et al., 1999Go; Ortiz et al., 1999Go), discrimination of correct and incorrect threading results (Olmea et al., 1999Go), for the prediction of protein–protein interfaces (Pazos et al., 1997Go) and for the filtering of putative docking solutions (Pazos et al., 1997Go). Here, we show that covariance analysis of multiple protein sequence alignments can be used for the prediction of structural domains. Improvements on random estimates of domain boundaries are modest but clear and it is possible to identify a subset of the most accurate predictions by further analysis. Applications of the method to CASP3 targets and to geminivirus AL1 protein illustrate its usefulness.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Derivation of a data set of predicted contacts

The data set (Table IGo) was derived from chains in structures of the Protein Data Bank (PDB; Berman et al., 2000Go) possessing exactly two sequential structural domains as defined in the CATH (Orengo et al., 1997Go) database v.1.6. A list of PDB entries sharing at most 25% pairwise sequence identity between any two members (Hobohm et al., 1992Go; http://swift.embl-heidelberg.de/pdbsel) was used to filter out homologous structures. The sequences of the resulting 52 chains were obtained from the PDB and sent to the Hidden Markov Models server (http://www.cse.ucsc.edu/research/compbio/HMM-apps; Karplus et al., 1998Go) with the request to return a multiple alignment of homologous sequences. This alignment was edited using Jalview (http://circinus.ebi.ac.uk:6543/ jalview) to remove those sequences with more than ~10% of gaps relative to the parent sequence. Identical sequences were also removed. The alignment was then converted to HSSP format using the PredictProtein server (http://dodo.cpmc. columbia.edu/predictprotein; Rost, 1996Go) and covarying amino acids were calculated using the PREDICT program (Olmea and Valencia, 1997Go). The improved algorithm, also incorporating information from sequence conservation (Gobel et al., 1994Go) was selected. The mean pairwise sequence identity within each alignment was calculated with the aid of MODELLER-4 (Sali and Blundell, 1993Go).


View this table:
[in this window]
[in a new window]
 
Table I. Basic characteristics of the predicted contact data set
 
Using exactly the same steps, two decoy data sets were created from protein structures having a single domain, in order to test the ability of the predicted contact density (PCD) profile method to distinguish between one- and two-domain proteins. One decoy data set contained 10 proteins of approximately 147 residues, this being the mean calculated for non-homologous single-domain proteins in the CATH database. The parent sequences were those of the PDB database chains (1SRA0, 2OCCD, 1GPR0, 1HLM0, 1VHH0, 1APYB, 1AAK0, 1JON0, 1DEF0 and 1ATO0) with lengths from 141 to 162 residues. Another data set contained 11 longer single-domain proteins (1AOMA, 1ISO0, 1QCWA, 1SIG0, 2DNJA, 3PCGA, 2BLTA, 1AVPA, 1PLQ0, 1HFC0 and 2BCT0) with lengths of 169–567 residues, mean 337. Other characteristics of the data sets match well those of the two-domain data set; mean alignment pairwise identities are 48 and 54 for the decoy data sets (compared with 48 for the two-domain data set) whereas the mean numbers of sequences in the alignments are 64 and 162 (104 for the two-domain data sets) and mean number of predicted contacts 19 and 14 (19).

Predictions were made for the two-domain fold recognition targets included in the CASP3 experiment (Murzin, 1999Go) using alignments constructed in exactly the same way as previously. These corresponded to the PDB chains 1BKB0, 1B9KA and 1DW9A (CASP3 identities T0063, T0071 and T0083, respectively). The target T0044 (PDB code 1qmhA) was excluded since one of its domains is inserted into the other. These proteins are not suitable for PCD analysis and their sequence analysis in general is difficult (Russell and Ponting, 1998Go). The family of geminivirus AL1 sequences was also analysed using an alignment built from the parent sequence—AL1 from bean golden mosaic virus (Gilbertson et al., 1991Go).

Prediction of domain boundaries from covariance data

Predictions were made for each alignment using PCD profiles derived from the covariance analysis. Each occurrence of covariance between two sequence alignment positions reported was taken as a predicted 3D contact between the respective amino acids. For each possible domain boundary location the number of contacts in the corresponding inter-domain contact map region was divided by the corresponding area of the inter-domain contact map region. The smaller this value is for a given possible domain boundary location, the higher the presumed chance of its corresponding to an actual 3D structural domain division. Smoothing was carried out by replacing each profile value with the mean value for all positions within a given size of running window centred on the original position. Local minima were then located in the smoothed profile and the five minima with the lowest profile values recorded. The positions of these minima are referred to as LM1–5. A close hit was recorded if the positions of the local minimum with the lowest profile value and the true structural domain boundary differed by fewer than 15 residues. The number of minima in the profile and the depths of LM1–5, defined as the mean profile value of the two flanking local maxima minus the profile value of the local minimum, were also recorded. All profile values, and hence also the LM1 depth and LM1 depth/number of local minima figures, were routinely multiplied by 1000 for convenience.

Various types of data sets were tested based either on the default PREDICT output of 19 ± 6 (mean ± SD) predictions or on a longer list of predictions including those of lower assigned confidence values (48 ± 19). To these two bases were applied two kinds of cut-offs—by assigned confidence value (0.8, 0.7, 0.6, 0.5 and 0.4) or as a percentage of more confident predictions (85, 70, 55 and 40). The profiling algorithm was tested either making use of the assigned prediction contact values to weight the points on the residue contact map, or simply assigning each prediction the same weight.

Repeated predictions were made for entire data sets while systematically altering various parameters. The most important of these were the presumed minimum domain size (in the range 20–45 residues) and the size of the smoothing window (5–19 residues). For each set of parameters summary results tables were produced recording various performance indicators; the mean distance between LM1 and the true structural domain boundary, how many times LM1 was the nearest minimum to the true boundary, and the number of times LM1, LM2 or LM3 were located within 15 residues of the actual domain division. The number of non-predictions obtained, corresponding to profiles lacking local minima, was also recorded, as were mean distances of LM1 from the true domain boundary, both for all predictions and for the subset for which LM was the nearest of the LMs.

In order to better estimate the significance of the results, 10 randomized data sets were generated by simply replacing the two residues predicted to contact with two randomly chosen from the length of the protein. Using exactly the same methodology, profile analysis was carried out for each, performance indictors recorded and averaged over the 10 data sets. In order to determine the improvement over random predictions made with the real data, the performance indicators obtained for real data were divided by the averaged randomized figures. For example, for better than random predictions this factor will be >1 in the case of the number of close hits recorded but <1 in the case of error measurements.

In order to determine the reasonable expected accuracy limits of the method, simulated sets of predicted contact data were randomly generated. The overall size of the hypothetical sequence and the ratio of the sizes of its two domains were varied. For each combination, 400 random data sets were generated each containing 18–20 simulated predictions, corresponding roughly to the mean number of predictions obtained for real data (Table IGo). Various degrees of inter-domain region depletion were tested from 0.1 (inter-domain contact density is one tenth that of the intra-domain regions) to 1.0 (inter- and intra-domain regions have the same contact density). Analysis of these data was carried out using parameters later found to be optimum for real data (minimum domain size of 40 residues, smoothing window of nine residues).

Threading experiments were carried out using the hybrid methods of Fischer (Fischer, 2000Go) at the Bioinbgu server (http://www.cs.bgu.ac.il/~bioinbgu), using the 3D-PSSM program (Kelley et al., 2000Go; http://www.bmm.icnet.uk/~3dpssm) and using Genthreader (Jones, 1999Go; http://insulin.brunel.ac.uk/psipred). All PCD profile calculations and analyses were carried out using programs written in YABasic (http://www.yabasic.de) on PCs.


    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
The basis of the prediction of structural domains

The premise of structural domain prediction using covariance analysis is illustrated in Figure 1Go. Since domains are, by definition, more compact than multi-domain proteins, in a residue contact map constructed from structural data, the inter-domain region(s), shaded in Figure 1Go, will be more sparsely populated than the intra-domain regions. Given sufficiently accurate predictions from covariance analysis a predicted contact map should also possess a depleted zone corresponding to the inter-domain region. In the simplest case of proteins consisting of two sequential structural domains (i.e. the polypeptide chain folds first one domain and then the other), it then only remains to search for a single domain boundary. This can readily be carried out by constructing a profile containing the inter-domain PCD resulting from envisioning each possible domain boundary. In favourable cases the true domain boundary would be located near a local minimum of the profile, ideally that with the lowest profile value. (Note that in Figure 1Go, and throughout this article, profile values are multiplied by 1000 for convenience.) For some applications, such as fold recognition, several putative domain boundaries could be tested experimentally so that location of the true domain boundary within 15 residues of the top three local minima (LM1–3) would still signify a useful result. Using these two criteria of success, the potential of this kind of analysis was measured using simulated data.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 1. Residue contact map (a) and corresponding contact density profile (b) for a hypothetical protein of two domains, lengths 100 and 200 residues. Each data point indicates a residue that is predicted to be in contact with another residue. For example, residue 10 is predicted to be in contact with residue 90. Only the upper left half of the contact map is occupied since the lower residue number for each contact is taken as the x coordinate. The shaded area corresponds to the inter-domain region of the contact map which is less populated than the two intra-domain regions (unshaded, labelled 1 and 2). The local minimum (LM) of the contact density profile with the lowest profile value (LM1) identifies the domain boundary with an error of four residues. Here and throughout, (predicted) contact density profiles values, and therefore their derived properties, are multiplied by 1000 for convenience.

 
Characterization of the problem using simulated contact predictions

In order to examine the feasibility of the approach outlined above simulated sets of predicted contact data were randomly generated (see Materials and methods). Figure 2aGo shows the number of cases for each parameter set in which the local minimum with the lowest profile value gave an accurate prediction (within 15 residues of the actual domain boundary) and Figure 2bGo the number of cases in which none of the three local minima with the lowest profile values corresponded to the domain boundary. The results show that under the most favourable conditions of maximal predicted inter-domain prediction zone depletion, accurate identification of domain boundaries by profile analysis is feasible. In the most favourable case (the hypothetical protein with two domains of 200 residues each), LM1 corresponds well to the domain boundary in 368/400 cases whereas the boundary is outside 15 residues from LM1–3 in just 29 cases. For the hardest hypothetical protein (with domains of 100 and 300 residues), LM1 indicates the domain boundary correctly in 290 cases and none of LM1–3 mark the boundary in 56 cases. In all cases, the highly successful predictions decrease in number (Figure 2aGo) and the failures increase (Figure 2bGo) as the inter-domain contact density approaches that of the intra-domain regions. It is also clear that predictions for small proteins are more successful than those for large proteins, presumably since the number of possible domain boundaries is fewer in the cases of the former. These results are expected, but the behaviour of different hypothetical proteins differs in surprising ways. For example, from similar success rates at an inter-domain depletion factor of 0.1 (316 and 365 for the 130/270 and 50/150 proteins, respectively), the success rate for the 130/270 domain hypothetical protein drops to just 138 at a depletion factor of 0.4 whereas for the 50/150 protein success is maintained in 342 cases.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 2. Performance of the PCD profile method on simulated data. In (a) the number of cases in which LM1 is found within 15 residues of the true domain boundary is plotted for different hypothetical protein domain combinations. In (b) the number of true boundaries not present in the lowest three local minima is shown for the same data.

 
The overall message from these simulations is that accurate domain boundary identification is certainly possible under favourable conditions (low inter-domain depletion factors) but that success depends on protein size (which would be known in advance) and domain size ratio (which would not), particularly under less favourable conditions.

A data set of known two-domain proteins

In order to apply the contact density profile method for domain boundary prediction to real proteins a data set of multiple sequence alignments was assembled for families of proteins known to have exactly two sequential domains, and covariance analysis carried out (see Materials and methods). These proteins represent the simplest cases addressable by the PCD method. Failure with these would imply that more complicated domain arrangements, requiring more complicated algorithms, would be intractable to PCD analysis. The HMM methods used to assemble the alignments are particularly suitable for this work since the alignments are of good overall quality (Karplus et al., 1999Go) and the technique is capable of effectively identifying distant homologues (Park et al., 1998Go), whose presence is known to improve the quality of the results of this type of analysis (Olmea and Valencia, 1997Go). Some fundamental characteristics of the data set are shown in Table IGo. The members of the test data set are highly heterogeneous, with the founder members sharing no more than 25% pairwise sequence identity. The size of the founder members varies widely from 116 to 601 residues with a mean of 257 residues and a SD of 121 residues. Similarly, domain sizes vary from 29 to 354 residues, mean ± SD of 122 ± 71 residues. Both the number of sequences and their mean pairwise sequence identity within each alignment vary widely. Alignments with fewer than five or more than 650 sequences were discarded since in the former case predicted contacts from covariance analysis would be of very poor quality, and in the latter case computational demands would be exorbitant. Of the retained alignments, membership ranges from five to 609 sequences. Sequence variation, expressed as mean pairwise sequence identity between alignment members, ranges from 18.9 to 95.9, mean ± SD of 47.6 ± 17.4. Analysis of these alignments with the PREDICT program for correlated mutation analysis yields between nine and 36 contact predictions, mean ± SD of 19 ± 6.0. The depletion factor of the inter-domain predicted contact region relative to the PCD of the whole contact map was calculated for each contact prediction set using the known domain boundaries. The results varied from 0 to 2.08, mean ± SD of 0.87 ± 0.50 with numbers above 1.0 indicating that the inter-domain predicted contact region is more densely populated than the contact map as a whole. Depletion factors above 1.0 reflect inaccurate contact predictions and could not exist for maps of true contacts. The mean depletion factor of 0.85 indicates only a very modest average depletion, considering the importance of this factor, as shown using simulated data (Figure 2Go). It is also worth noting that depletion factors of 0 (Table IGo) were often the result of few contact predictions being made, or the concentration of the predictions just in one domain, rather than the ideal style of distribution shown in Figure 1Go.

Results for two-domain proteins and statistical analysis

For each choice of data set (see Materials and methods), weighting and calculation type predictions were made using various smoothing window sizes and assumed minimum domain sizes (see Materials and methods). Adoption of a smoothing window was necessary in order that the many small local minima resulting from the sparse predicted contact data were ignored in favour of the larger local minima. The use of an assumed minimum domain size helped reduce the problem of the ragged ends present in most alignments. Since the covariance calculation ignores positions with more than 10% gaps, these ragged ends would lead to a lack of predictions for the terminal regions, and hence erroneous areas of low PCD in the resultant profiles.

Several results were monitored including the number of times LM1 was the closest local minimum to the actual domain boundary, the mean number of residues between LM1 and the domain boundary, this same mean for the subset of cases when LM1 was the closest LM to the domain boundary and the number of correct predictions where LM1 lay within 15 residues of the domain boundary. In order to assign statistical significance to these results, they were compared with mean values for corresponding randomized data sets (see Materials and Methods).

Analysis showed that inclusion of additional lower confidence predictions lowered prediction accuracy, monitored as above (data not shown); the default PREDICT output performed as well as any cut-off data set. Apparently the additional points on the PCD map, which might be expected to enhance the occasionally sparse data distribution, are not sufficiently accurate to justify inclusion. Other experiments showed that using the assigned confidence values in the default PREDICT output had negligible effect on the accuracy of the predictions (data not shown), perhaps because they generally have similar high values (0.75 ± 0.09 for all default predictions). Therefore, equally weighted default PREDICT results were used exclusively for further analysis.

Table IIGo shows the performance of the PCD method applied to the real default data sets compared with mean values from predictions made, using identical methodology, for 10 randomized contact lists. All values are means taken from 35 analyses using smoothing windows of 7, 9, 11, 13, 15, 17 or 19 residues in combination with assumed minimum domain sizes of 25, 30, 35, 40 or 45 residues. Entries in the table represent real/random factors so that values <1 signify statistically significant improvements in the LM1 distance from boundary columns whereas values >1 imply better than random performance in the remaining columns. Table IIGo clearly shows that the predictions made by PCD profiling are statistically better than random, albeit modestly so. For example, LM1 is the closest LM to the true domain boundary in up to twice as many cases as calculated from randomized data. It is also notable that the number of LM1 close hits reaches a value more than double that expected by chance. Further inspection of the results revealed a heterogeneous mixture of remarkably accurate predictions with others of varying inaccuracy. Therefore, a search was made for a way of identifying the successful predictions.


View this table:
[in this window]
[in a new window]
 
Table II. Performance of predictions made for real data compared with that for randomized data sets
 
Factors correlated with prediction accuracy

A correlation was noted between a low number of profile minima and low distances between LM1 and the true domain boundary (data not shown). For profiles containing a single local minimum, LM1 errors were in the range 2–7. When two local minima were present, the maximum LM1 error was 26, and the trend seems to continue for larger numbers of minima. The depth of each individual local minimum also correlated with their distance from the actual domain boundary. When the error is plotted against local minimum depth for up to five lowest local minima, large depths are associated solely with more accurate predictions (data not shown). Low depth local minima are associated with errors of all sizes. When the depth of each local minimum is combined with the number of minima of the profile from which it came, the tail of the new graph is even more marked; predictions characterized by large depth/number of local minima values are even better associated with low local minimum-domain boundary errors (data not shown). Using this LM1 depth/number of local minima formulation, parameters were re-examined in order to see which combination enabled the identification of the largest number of accurate predictions.

Overall, the most effective parameter set for the identification of the best predictions comprises an assumed minimum domain size of 40 residues and a smoothing window of nine residues. Figure 3Go illustrates the effect of varying these ideal parameters (Figures 3a and bGo), and also a comparison of the LM1 depth/number of local minima criterion with the simpler LM1 depth measure (Figure 3cGo). In Figure 3Go, a line is drawn for each different set of parameters, resulting from the application of different LM1 depth/number of local minima cut-offs in the range 0.6–4.0, each leading to a certain number of predictions below the cut-off with a certain mean error. LM1 depth cut-offs in the range 0.15–0.6 were used.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 3. Determination of the best parameters for reliable identification of more accurate predictions. In each case the number below a cut-off is plotted against their mean prediction error. (a) Effect of assumed minimum domain size; (b) effect of smoothing window size; (c) effect of choice of cut-off type.

 
The optimal parameter set produces an essentially linear relationship between the number of predictions below the cut-off and their mean error. With these parameters and the most stringent cut-off, three predictions can be identified with a mean error of just five residues. For each additional three residues that the cut-off is made more relaxed, the mean error increases by approximately two residues. The relationship between the value of cut-off, number of sequences making the cut-off and their mean error is tabulated in Table IIIGo. Using these data, the LM1 depth/number of local minima for new tested alignments can be assigned a probable degree of significance. Table IIIGo also shows similar data obtained using a simple LM1 depth cut-off, which although performing worse overall, was found to be more indicative in some individual cases. This presumably reflects the independence of the number of local minima and LM1 depth indicators of prediction accuracy.


View this table:
[in this window]
[in a new window]
 
Table III. Selection of accurate predictions using LM1 depth/number of local minima and LM1 depth criteria at different cut-off levels
 
Characteristics of predictions made with optimized parameters

A final prediction of domain boundaries for the test data set was made and analysed to determine factors associated with successful and unsuccessful predictions. Experiments with simulated data (Figure 2Go) suggested that large proteins of similar domain size and low inter-domain depletion factor should be the most difficult cases. The dependence of the domain boundary prediction on covariance analysis suggests that factors associated with accurate contact predictions—sequence diversity and number of sequences in the alignment (Olmea and Valencia, 1997Go)—should also have a positive influence on domain boundary prediction accuracy.

Figure 4Go shows the relationship of prediction error to these characteristics. Indeed, domain boundaries in larger proteins are in general predicted less well (Figure 4aGo). Applied linear regression leads to a weak but significant correlation coefficient of 0.49. However, it is worth noting that accurate predictions are present for some large proteins. Surprisingly, there seems to be no relationship between prediction accuracy and the ratio of domain sizes (data not shown), with the two least accurate predictions made for proteins with domains of unequal sizes. However, the number of predictions for proteins of unequal domain size may be too small to draw reliable conclusions. Only a very weak relationship between actual inter-domain depletion and LM1 prediction error was evident (Figure 4bGo). However, examination of just those cases where LM1 indicated the domain boundary with an error of less than 15 residues highlights the importance of inter-domain PCD depletion. When all the predictions are ranked by inter-domain PCD depletion and divided into two groups, just four accurate predictions are found in the cases with least depletion whereas 13 accurate predictions are made where inter-domain PCD is more depleted. Surprisingly, only weak correlations between prediction error and sequence variability (mean pairwise percent sequence identity within the alignments; Figure 4cGo) and between within error and the number of sequences in the alignment (Figure 4dGo) were observed. Alignments with fewer than 15 sequences are routinely thought to be inadequate for the prediction of residue contacts (Olmea and Valencia, 1997Go). To a certain extent our results confirm this trend since of the eight alignments containing fewer than 15 sequences used, four lead to non-predictions. However, two others, containing seven and 11 sequences, lead to predictions with errors of just 13 and 14 residues. Hence, useful information may be derived from some alignments of rather few sequences.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 4. Dependence of LM1 prediction error on (a) sequence size, (b) inter-domain PCD depletion, (c) number of sequences in the alignments and (d) mean pairwise percent sequence identity within the alignments.

 
In summary, the final analysis for 52 proteins yielded five non-predictions and 47 predictions. Non-predictions were most commonly observed for those alignments containing few sequences, with four of the six resulting from alignments with eight or fewer sequences. Of the predictions made, for 35% of the cases, the LM1 was located within 15 residues of the true domain boundary. For 60% of cases one of LM1, LM2 or LM3 was found within 15 residues of the actual domain boundary whereas the lowest five local minima cover 79% of correct domain boundaries. Application of the LM1 depth/number of local minima criterion enabled the identification of 30% of cases in which the mean domain boundary prediction error was just 12.9.

Examples of successful and unsuccessful predictions are shown in Figures 5 and 6GoGo, respectively. As the case of Pseudomonas 2,3-dihydroxyphenyl 1,2-dioxygenase (1DHY0) shows, a high degree of inter-domain contact depletion is not essential for an accurate prediction. The example of human salivary {alpha}-amylase (1SMD0) shows that accurate predictions can be made for the more difficult cases (Figure 2Go) of larger proteins. Unsuccessful predications can be divided into two categories—the non-predictions such as pertussis toxin (1PRTB), and the inaccurate predictions such as catabolite gene activator protein (2CGPA). In these cases the blame presumably lies with the intrinsically limited accuracy of the contact predictions (Olmea and Valencia, 1997Go), exacerbated in many of the cases of non-predictions by the limited number of sequences available.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 5. Predicted contact maps for two successful domain boundary predictions, (a) 1DHY0 and (b) 1SMD0, and their associated PCD profiles (c) and (d), respectively. Dotted lines mark predictions and solid lines actual domain boundaries in the profiles. The dotted boxes in the contact maps show their corresponding divisions into inter-domain (upper left) and intra-domain regions based on domain boundary predictions made.

 


View larger version (17K):
[in this window]
[in a new window]
 
Fig. 6. Predicted contact maps for two unsuccessful domain boundary predictions, (a) 1PRTB and (b) 2CGPA, and their associated PCD profiles (c) and (d), respectively. Lines and boxes have the same meaning as in Figure 5Go.

 
Application of the technique to a decoy data set of one-domain proteins

In order to assess the performance of the PCD profile method on single-domain proteins, two further data sets were generated, the first of one-domain proteins of the most typical length (141–162 residues) and the second of larger one-domain proteins with lengths in the range 169–567 residues (see Materials and methods). The results of the PCD profile analysis immediately showed that, whereas effective in highlighting more accurate predictions for two-domain proteins, the LM1 depth/number of local minima and LM1 depth/number of local minima characteristics are not capable of discriminating against false predictions made for one-domain proteins; predictions were made in both data sets at levels exceeding the cut-offs shown in Table IIIGo.

Comparison of false predictions for one-domain proteins and correct predictions for two domains revealed one characteristic with some discriminatory capability. For correct two-domain protein predictions above the LM1 depth/number of local minima value of 0.1, most (six out of nine) assignments lead to divisions of the contact map with both intra-domain regions populated. In contrast, only one of the six false predictions for the typical length one-domain data set fulfilled this criterion. For the larger single-domain protein data set, the single prediction above this cut-off did not lead to two populated intra-domain regions of the predicted contact map.

Therefore, it seems that combination of the `both intra-domain regions populated' rule enables many false predictions made for single-domain proteins to be discounted. Nevertheless, the false predictions remain a problem, particularly for smaller proteins. This is an important consideration when structural studies are projected, but is less significant for threading studies which generally require little time.

Application of the technique to CASP3 targets

Four of the targets of the CASP3 blind structure prediction contest (Moult et al., 1999Go) consisted of two domains—target IDs T0044, T0063, T0071 and T0083, now corresponding, respectively, to PDB chains 1QMHA, 1BKB0, 1B9KA and 1DW9A. Setting aside 1QMH, in which one domain is inserted into the other, thereby complicating analysis (Russell and Ponting, 1998Go), PCD profiles were calculated for the remaining three chains to see if predicted domain boundary definitions would have helped fold assignment. For 1BKB0, 1B9KA and 1DW9A, respectively, the alignments contained 71, 15 and 10 sequences, sharing 33–47% mean sequence identity, leading to 10, 13 and 19 predicted contacts.

For 1BKB0 and 1B9KA, predictions were made with characteristics suggesting high reliability. The LM1 depth/number of local minima values for these profiles were 0.49 and 0.68, respectively. Comparison of the predicted and actual domain boundaries revealed these predictions to be correct to within nine and three residues, respectively. In contrast, the LM1 depth/number of local minima value for 1DW9A was just 0.02, not indicative of a reliable result, and indeed the prediction was incorrect. The case of 1DW9A may not have been helped by the fact that the C-terminal domain does not form a compact structure, instead intertwining with corresponding domains in symmetry-related subunits (Walsh et al., 2000Go).

Encouraged by the two strong predictions, threading experiments were carried out to compare results for the sequences of entire chains, predicted domains and actual domains. The results of 3D-PSSM analysis (Kelley et al., 2000Go) are summarized in Table IVGo. Correct results for the N- and C-terminal domains of 1BKB0 were SH3-like folds and OB folds, respectively, whereas for 1B9KA the N- and C-terminal domains resemble immunoglobulin folds and TATA-box-binding protein structural repeats, respectively (Murzin, 1999Go).


View this table:
[in this window]
[in a new window]
 
Table IV. Effect of domain boundary information on threading results obtained with 3D-PSSM (Kelley et al., 2000Go) for two CASP3 targets
 
For all four domains, the use of PCD profile domain boundary predictions dramatically improves threading performance. Using the full sequences, correct matches for two of the domains were not contained in the results list, whereas matches for the other two appeared at positions 7 and 12. In contrast, using predicted domain boundaries, correct structural matches are found in the first place in two cases and at second and third positions in the other two cases. These structural matches are correct at the SCOP fold level in two cases, the superfamily level in one case and the family level in the fourth case. Whereas use of the sequences corresponding to actual domains generally improves the scores and rankings of correct structural matches (Table IVGo), it is clear that the PCD profile predictions would have been accurate enough to transform the prospects of correct fold recognition in these cases.

Application of the technique to geminivirus AL1 protein

AL1 (also known as Rep), possessing approximately 260 residues, is the only protein required for replication of all geminiviruses (Elmer et al., 1988Go) and contains multiple biochemical activities including DNA binding (Fontes et al., 1992Go). A series of experiments has culminated in the identification of the AL1 origin DNA-binding site and cleavage domain within residues 1–116 and 1–120, respectively (Gladfelter et al., 1997Go; Orozco et al., 1997Go).

An alignment of AL1 protein sequences was constructed from the parent sequence of bean golden mosaic virus (Gilbertson et al., 1991Go) using the same methods as applied to the test data set. It contained 117 sequences sharing a mean pairwise percentage sequence identity of 66 ± 15. Using the optimized parameter set a PCD profile was constructed and analysed. LM1 of this profile lay at residue 132 and had depth and depth/number of local minima characteristics of 0.18 and 0.03, respectively. Using Table IIIGo the depth/number of local minima is not indicative of a reliable prediction, but the depth of LM1 corresponds to an average error of approximately 19 residues. In addition, the domain definition agrees very well with the functionally defined AL1 origin DNA-binding site domain from residues 1–116 (Gladfelter et al., 1997Go).

The two putative domains of the bean golden mosaic virus AL1 sequence were then subjected to threading experiments. The most significant results were obtained using the methods of Fischer (Fischer, 2000Go) which suggested a structural correspondence between the first AL1 domain and the C-terminal single-stranded DNA-binding domain of topoisomerase (1YUA; Yu et al., 1995Go) which has a length of 122 residues. This domain belongs to the same SCOP superfamily (zinc ß-ribbon) as the single-stranded DNA-binding domains of DNA primases (1PFT; Pan and Wigley, 2000Go) and transcriptional elongation factors (1TFI; Qian et al., 1993Go). Therefore, the threading result matches well with the AL1 domain 1 sequence in terms of length (122 versus 116) and biochemical activity (single-stranded DNA binding).

The dependence of the threading result on sequence length is shown in Figure 7Go. It shows that as the length of the AL1 domain sequence supplied deviates from the topoisomerase domain length, the threading score drops rapidly, particularly in the direction of smaller sequences. When 40 too few or too many residues are analysed, the topoisomerase domain is no longer the highest scoring fold. These results confirm the sensitivity of threading to sequence length. In this case the PCD profiling method gave a domain size 10 residues larger than that producing the best threading results. However, even with this error the threading result was 96% of the best achievable.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 7. Dependence of Bioinbgu threading (Fischer, 2000Go) results on the length of AL1 sequence supplied.

 
Applications of the methodology and future prospects

With the advent of genome sequencing projects, the number of protein sequences in the databases is growing exponentially. With this deluge of sequence data comes the challenge of adequately annotating new sequences. Although new homology-independent methods are arriving (Marcotte, 2000Go), the bulk of current sequence annotation is based on identification of homology of new sequences with those already characterized. This enables the assignment of characteristics for the new sequence with a degree of accuracy dependent on degree of sequence similarity (Devos and Valencia, 2000Go). New sequence analysis techniques will also contribute to improved functional annotation (Gallet et al., 2000Go; Hannenhalli and Russell, 2000Go). The sensitivity of sequence comparison techniques is continually improving (Altschul et al., 1997Go; Karplus et al., 1998Go) but there remain cases where sequence divergence has occurred to such an extent that truly homologous proteins are no longer detectable by sequence comparisons (Rost, 1999Go). Threading methods help significantly in these cases (Fischer and Eisenberg, 1997Go) but suffer from their sensitivity to the length of supplied sequence; correct matches to a known domain structure may not be obtained if the length of sequence supplied differs markedly from the size of the domain (Fischer et al., 1999Go; Table IVGo and Figure 7Go). By supplying a list of possible domain boundaries and means to judge their reliability (Table IIIGo) the PCD profile methodology outlined in this article should help in these cases.

X-ray crystallographic and NMR studies of isolated protein domains offer a route for structural analysis of proteins whose size, flexibility or other characteristics render whole-protein analysis impossible. Accurate domain boundary knowledge is crucial in these cases to avoid the presence of unstructured tails or domain destabilization through removal of structurally important regions. Whereas the overall success rate of the PCD profile method (35% of top predictions within 15 residues) is low, a subset of accurate predictions can be identified. The top six, scoring 2.8 or more by the LM1 depth/number of local minima criterion (Table IIIGo), have errors from CATH domain definitions of 0–7 residues. Therefore, they correspond to essentially correct assignments, especially recalling the different results generated by different domain assignment programs for the same supplied structure (Holm and Sander, 1994Go; Sidduqui and Barton, 1995Go; Swindells, 1995Go). These predictions could safely have been used as the basis for structural studies (Figure 8Go).



View larger version (42K):
[in this window]
[in a new window]
 
Fig. 8. The six domain boundary predictions judged most reliable by application of the LM1 depth/number of local minima criterion. N-terminal domains are shown in white and C-terminal domains in grey with dotted lines marking the domain boundaries. The figure was made with Molscript (Kraulis, 1991Go).

 
While a useful number of accurate domain boundary predictions may be made using the current methodology, there is obviously room for improvement. In the future it should prove possible to improve prediction accuracy using other sources of data. The most obvious of these is knowledge of statistical trends in domain size, which has been shown to be sufficient in itself for useful domain predictions (Wheelan et al., 2000Go). Other sequence analysis methods for domain boundary identification have also been introduced (Kuroda et al., 2000Go). Less directly, domain fold class ({alpha}, ß, {alpha} + ß, {alpha}/ß) can be reasonably well predicted simply from amino acid composition (Bu et al., 1999Go). If the domains of a protein belong to different classes then such techniques should be able to help discriminate between PCD profile solutions, if not to predict boundaries unaided. Another source of information might be secondary structure prediction since domain boundaries seem to fall preferentially between regular secondary structure elements [65% in the predicted contacts data set, compared with a typical coil content of 47% in proteins as a whole (Rost and Sander, 2000Go)]. The quality of the PCD profile results will, of course, also improve as ongoing studies (Choulier et al., 2000Go; Larson et al., 2000Go) result in more accurate contact predictions.


    Notes
 
1 E-mail: daniel{at}cenargen.embrapa.br Back


    Acknowledgments
 
I wish to thank those responsible for creating and maintaining the internet servers used in this work. I am also very grateful to Dr Linda A.Fothergill-Gilmore for her reading of early versions of the manuscript and to the anonymous referees for useful suggestions.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 References
 
Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bantscheff,M., Weiss,V. and Glocker,M.O. (1999) Biochemistry, 38, 11012–11020.[CrossRef][ISI][Medline]

Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L.L. (2000) Nucleic Acids Res., 28, 263–266.[Abstract/Free Full Text]

Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235–242.[Abstract/Free Full Text]

Bocquier,A.A., Potts,J.R., Pickford,A.R. and Campbell,I.D. (1999) Structure Fold Des., 7, 1451–1460.[Medline]

Bu,W.S., Feng,Z.P., Zhang,Z. and Zhang,C.T. (1999) Eur. J. Biochem., 266, 1043–1049.[Abstract/Free Full Text]

Chan,C.L., Lonetto,M.A. and Gross,C.A. (1996) Structure, 4, 1235–1238.[ISI][Medline]

Choulier,L., Lafont,V., Hugo,N. and Altschuh,D. (2000) Proteins, 41, 475–484.[CrossRef][ISI][Medline]

Cohen,S.L. (1996) Structure, 4, 1013–1016.[ISI][Medline]

Corpet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) Nucleic Acids Res., 28, 267–269.[Abstract/Free Full Text]

Devos,D. and Valencia,A. (2000) Proteins, 41, 98–107.[CrossRef][ISI][Medline]

Elmer,J.S., Brand,L., Sunter,G., Gardiner,W.E., Bisaro,B.M. and Rogers,S.G. (1988) Nucleic Acids Res., 16, 7043–7060.[ISI][Medline]

Fischer,D. (2000) Pacific Symp. Biocomputing. Hawaii, pp. 119–130.

Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 11929–11934.[Abstract/Free Full Text]

Fischer,D., Barret,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., Karplus,K.J., Kelley,K.A., Maccallum,R.M., Pawowski,K. et al. (1999) Proteins, (Suppl. 3), 209–217.

Fontes,E.P.B., Luckow,V.A. and Hanley-Bowdoin,L. (1992) Plant Cell, 4, 597–608.[Abstract/Free Full Text]

Gallet,X., Charloteaux,B., Thomas,A. and Brasseur,R. (2000) J. Mol. Biol., 302, 917–926.[CrossRef][ISI][Medline]

Gilbertson,R.L., Hidayat,S.H., Martinez,R.T., Leong,S.A., Faria,J.C., Morales,F.J. and Maxwell,D.P. (1991) Plant Dis., 75, 336–342.[ISI]

Gladfelter,H.J., Eagle,P.A., Fontes,E.P.B., Batts,L. and Hanley-Bowdoin,L. (1997) Virology, 239, 186–197.[CrossRef][ISI][Medline]

Gobel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309–317.[ISI][Medline]

Gracy,J. and Argos,P. (1998a) Trends Biochem. Sci., 23, 497–497.[CrossRef][ISI][Medline]

Gracy,J. and Argos,P. (1998b) Bioinformatics, 14, 174–187.[Abstract]

Hannenhalli,S.S. and Russell,R.B. (2000) J. Mol. Biol., 303, 61–76.[CrossRef][ISI][Medline]

Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409–417.[Abstract/Free Full Text]

Holm,L. and Sander,C. (1994) Proteins, 19, 256–268.[ISI][Medline]

Jones,D.T. (1999) J. Mol. Biol., 287, 797–815.[CrossRef][ISI][Medline]

Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846–856.[Abstract]

Karplus,K., Barrett,C., Cline,M., Diekhans,M., Grate,L. and Hughey,R. (1999) Proteins, (Suppl. 3), 121–125.

Kelley,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 501–522.[CrossRef]

Kraulis,J. (1991) J. Appl. Crystallogr., 24, 946–950.[CrossRef][ISI]

Kuroda,Y., Tani,K., Matsuo,Y. and Yokoyama,S. (2000) Protein Sci., 9, 2313–2321.[Abstract]

Larsen,T.M., Laughlin,L.T., Holden,H.M., Rayment,I. and Reed,G.H. (1994) Biochemistry, 33, 6301–6309.[ISI][Medline]

Larson,S.M., DiNardo,A.A. and Davidson,A.R. (2000) J. Mol. Biol., 303, 433–446.[CrossRef][ISI][Medline]

Lesk,A.M. and Chothia,C. (1980) J. Mol. Biol., 136, 225–270.[ISI][Medline]

Marcotte,E.M. (2000) Curr. Opin. Struct. Biol., 10, 359–365.[CrossRef][ISI][Medline]

Matthews,B.W. (1997) Methods Enzymol., 276, 3–10.[CrossRef][ISI]

Moult,J., Hubbard,T., Fidelis,K. and Pedersen,J.T. (1999) Proteins, (Suppl. 3), 2–6.

Murzin,A.G. (1999) Proteins, (Suppl. 3), 88–103.

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25–S32.[ISI][Medline]

Olmea,O., Rost,B. and Valencia,A. (1999) J. Mol. Biol., 295, 1221–1239.[CrossRef]

Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 1093–1108.[ISI][Medline]

Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins, 37, 149–170.[CrossRef][Medline]

Orozco,B.M., Miller,A.B., Settlage,S.B. and Hanley-Bowdoin,L. (1997) J. Biol. Chem., 272, 9840–9846.[Abstract/Free Full Text]

Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Proteins, 37, 177–185.[CrossRef][Medline]

Owen,D.J., Papageorgiou,A.C., Garman,E.F., Noble,M.E. and Johnson,L.N. (1995) J. Mol. Biol., 246, 374–381.[CrossRef][ISI][Medline]

Pan,H. and Wigley,D.B. (2000) Structure Fold Des., 8, 231–239.[ISI][Medline]

Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 1201–1210.[CrossRef][ISI][Medline]

Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 272, 1–13.[CrossRef][ISI][Medline]

Potts,J.R., Bright,J.R., Bolton,D., Pickford,A.R. and Campbell,I.D. (1999) Biochemistry, 38, 8304–8312.[CrossRef][ISI][Medline]

Qian,X., Gozani,S.n., Yoon,H., Jeon,C.J., Agarwal,K. and Weiss,M.A. (1993) Biochemistry, 32, 9944–9959.[ISI][Medline]

Rossmann,M.G. and Argos,P. (1981) Annu. Rev. Biochem., 50, 497–532.[CrossRef][ISI][Medline]

Rost,B. (1996) Methods Enzymol., 266, 525–539.[CrossRef][ISI][Medline]

Rost,B. (1999) Protein Eng., 12, 85–94.[Abstract/Free Full Text]

Rost,B. and Sander,C. (2000) 3rd generation prediction of secondary structure. In Webster, D.M. (ed.), Predicting Protein Structure: Methods and Protocols. Humana Press, pp. 71–95.

Russell,R.B. and Ponting,C.P. (1998) Curr. Opin. Struct. Biol., 8, 364–371.[CrossRef][ISI][Medline]

Sali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779–815.[CrossRef][ISI][Medline]

Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) Nucleic Acids Res., 28, 231–234.[Abstract/Free Full Text]

Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349–358.[Abstract]

Sidduqui,A.S. and Barton,G.J. (1995) Protein Sci., 4, 872–884.[Abstract/Free Full Text]

Sticht,H., Pickford,A.R., Potts,J.R. and Campbell,I.D. (1998) J. Mol. Biol., 276, 177–187.[CrossRef][ISI][Medline]

Swindells,M.B. (1995) Protein Sci., 4, 103–112.[Abstract/Free Full Text]

Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341–348.[Abstract]

Walsh,M.A., Otwinowski,Z., Perrakis,A., Anderson,P.M. and Joachimiak,A. (2000) Structure Fold Des., 8, 505–514.[Medline]

Wheelan,S.J., Marchler-Bauer,A. and Bryant,S.H. (2000) Bioinformatics, 16, 613–619.[Abstract]

Yu,L., Zhu,C.X., Tse-Dinh,Y.C. and Fesik,S.W. (1995) Biochemistry, 34, 7622–7628.[ISI][Medline]

Received April 25, 2001; revised September 27, 2001; accepted November 1, 2001.