Embrapa Genetic Resources and Biotechnology, Cenargen/Embrapa, S.A.I.N. Parque Rural, Final W5, Asa Norte, 70770-900, Brasília, Brazil
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: correlated mutations/covariance analysis/multiple sequence alignment/structural domains/threading
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Largely automated sequence comparison protocols are responsible for several databases, of aligned protein domains such as PFAM (Bateman et al., 2000), PROMOD (Park et al., 1998
), DOMO (Gracy and Argos, 1998a
,b
) and SMART (Schultz et al., 2000
). The assignment of domain boundaries for entries in these databases sometimes originates in a manually-curated `seed' alignment as is the case for PFAM [and now also incorporated into PRODOM (Corpet et al., 2000
)]. Alternatively, computer analysis is applied based either on the recurrence of similar sequence segments in different proteins at different distances from the N- and C-termini, or on duplicated segments observed in protein sequences (Gracy and Argos, 1998b
). Hence, accurate domain boundary assignment requires, ideally, structural information, or otherwise the repeated occurrence of a domain in different contexts. A problem therefore arises for protein families which lack relevant structural information and whose structures comprise several domains. If these domains are only observed in a single order [as, for example, the four domains of eukaryotic pyruvate kinases (Larsen et al., 1994
)], or if sequence comparisons fail to reveal their presence elsewhere, then the current protein domain databases will erroneously assign a single domain to the whole protein.
Knowledge of structural domain boundaries is not just of theoretical interest, but also of great practical importance. For example, conformational heterogeneity is known to impede the crystallization of a protein, the first step towards the determination of its structure by X-ray crystallography. Such heterogeneity is often conferred on proteins comprising domains joined by more or less flexible linker regions. Indeed the literature abounds in proteins crystallized first, or only, as several individual domains (Owen et al., 1995; Chan et al., 1996
). Given prior knowledge of structural domain boundaries, molecular biology techniques could readily be used to produce individual domains that might crystallize more readily than the intact protein (Matthews, 1997
). Similarly, for the technique of nuclear magnetic resonance (NMR) spectroscopy, structural flexibility hampers, or renders impossible, structure determination. In the case of NMR there is an additional size limit complicating the structural determination of multi-domain proteins. Therefore, it is common for NMR to be applied to domains or pairs of domains, as in the case of fibrin (Sticht et al., 1998
; Bocquier et al., 1999
; Potts et al., 1999
). In the area of comparative protein modelling, it has been shown that threading techniques are highly sensitive to sequence length and work best when supplied with sequences of individual domains, with domain sequence not lacking and additional sequence not present (Fischer et al., 1999
). Limited proteolysis, sometimes coupled with mass spectrometry, offers an experimental route for the determination of domain structure (Cohen, 1996
; Bantscheff et al., 1999
) but suffers from the obvious disadvantage of requiring purified protein. Purely sequence-based methods of domain structure prediction (Kuroda et al., 2000
; Wheelan et al., 2000
) have much wider application.
It is well known that the constraints imposed upon side chain size and chemistry by the 3D packing environment can lead to sequence compensation between spatially close residues (Lesk and Chothia, 1980). In other words, the presence of a certain amino acid at position x may sometimes only be accommodated if a particular amino acid is present at position y. Hence, analysis of multiple sequence alignments could be used to make predictions about 3D amino acid contacts. This idea has been extensively investigated with the conclusion that a clear but weak signal is present in multiple sequence alignments (Gobel et al., 1994
; Shindyalov et al., 1994
; Taylor and Hatrick, 1994
). Covariance analysis has since been used for of ab initio protein structure prediction (Orengo et al., 1999
; Ortiz et al., 1999
), discrimination of correct and incorrect threading results (Olmea et al., 1999
), for the prediction of proteinprotein interfaces (Pazos et al., 1997
) and for the filtering of putative docking solutions (Pazos et al., 1997
). Here, we show that covariance analysis of multiple protein sequence alignments can be used for the prediction of structural domains. Improvements on random estimates of domain boundaries are modest but clear and it is possible to identify a subset of the most accurate predictions by further analysis. Applications of the method to CASP3 targets and to geminivirus AL1 protein illustrate its usefulness.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The data set (Table I) was derived from chains in structures of the Protein Data Bank (PDB; Berman et al., 2000
) possessing exactly two sequential structural domains as defined in the CATH (Orengo et al., 1997
) database v.1.6. A list of PDB entries sharing at most 25% pairwise sequence identity between any two members (Hobohm et al., 1992
; http://swift.embl-heidelberg.de/pdbsel) was used to filter out homologous structures. The sequences of the resulting 52 chains were obtained from the PDB and sent to the Hidden Markov Models server (http://www.cse.ucsc.edu/research/compbio/HMM-apps; Karplus et al., 1998
) with the request to return a multiple alignment of homologous sequences. This alignment was edited using Jalview (http://circinus.ebi.ac.uk:6543/ jalview) to remove those sequences with more than ~10% of gaps relative to the parent sequence. Identical sequences were also removed. The alignment was then converted to HSSP format using the PredictProtein server (http://dodo.cpmc. columbia.edu/predictprotein; Rost, 1996
) and covarying amino acids were calculated using the PREDICT program (Olmea and Valencia, 1997
). The improved algorithm, also incorporating information from sequence conservation (Gobel et al., 1994
) was selected. The mean pairwise sequence identity within each alignment was calculated with the aid of MODELLER-4 (Sali and Blundell, 1993
).
|
Predictions were made for the two-domain fold recognition targets included in the CASP3 experiment (Murzin, 1999) using alignments constructed in exactly the same way as previously. These corresponded to the PDB chains 1BKB0, 1B9KA and 1DW9A (CASP3 identities T0063, T0071 and T0083, respectively). The target T0044 (PDB code 1qmhA) was excluded since one of its domains is inserted into the other. These proteins are not suitable for PCD analysis and their sequence analysis in general is difficult (Russell and Ponting, 1998
). The family of geminivirus AL1 sequences was also analysed using an alignment built from the parent sequenceAL1 from bean golden mosaic virus (Gilbertson et al., 1991
).
Prediction of domain boundaries from covariance data
Predictions were made for each alignment using PCD profiles derived from the covariance analysis. Each occurrence of covariance between two sequence alignment positions reported was taken as a predicted 3D contact between the respective amino acids. For each possible domain boundary location the number of contacts in the corresponding inter-domain contact map region was divided by the corresponding area of the inter-domain contact map region. The smaller this value is for a given possible domain boundary location, the higher the presumed chance of its corresponding to an actual 3D structural domain division. Smoothing was carried out by replacing each profile value with the mean value for all positions within a given size of running window centred on the original position. Local minima were then located in the smoothed profile and the five minima with the lowest profile values recorded. The positions of these minima are referred to as LM15. A close hit was recorded if the positions of the local minimum with the lowest profile value and the true structural domain boundary differed by fewer than 15 residues. The number of minima in the profile and the depths of LM15, defined as the mean profile value of the two flanking local maxima minus the profile value of the local minimum, were also recorded. All profile values, and hence also the LM1 depth and LM1 depth/number of local minima figures, were routinely multiplied by 1000 for convenience.
Various types of data sets were tested based either on the default PREDICT output of 19 ± 6 (mean ± SD) predictions or on a longer list of predictions including those of lower assigned confidence values (48 ± 19). To these two bases were applied two kinds of cut-offsby assigned confidence value (0.8, 0.7, 0.6, 0.5 and 0.4) or as a percentage of more confident predictions (85, 70, 55 and 40). The profiling algorithm was tested either making use of the assigned prediction contact values to weight the points on the residue contact map, or simply assigning each prediction the same weight.
Repeated predictions were made for entire data sets while systematically altering various parameters. The most important of these were the presumed minimum domain size (in the range 2045 residues) and the size of the smoothing window (519 residues). For each set of parameters summary results tables were produced recording various performance indicators; the mean distance between LM1 and the true structural domain boundary, how many times LM1 was the nearest minimum to the true boundary, and the number of times LM1, LM2 or LM3 were located within 15 residues of the actual domain division. The number of non-predictions obtained, corresponding to profiles lacking local minima, was also recorded, as were mean distances of LM1 from the true domain boundary, both for all predictions and for the subset for which LM was the nearest of the LMs.
In order to better estimate the significance of the results, 10 randomized data sets were generated by simply replacing the two residues predicted to contact with two randomly chosen from the length of the protein. Using exactly the same methodology, profile analysis was carried out for each, performance indictors recorded and averaged over the 10 data sets. In order to determine the improvement over random predictions made with the real data, the performance indicators obtained for real data were divided by the averaged randomized figures. For example, for better than random predictions this factor will be >1 in the case of the number of close hits recorded but <1 in the case of error measurements.
In order to determine the reasonable expected accuracy limits of the method, simulated sets of predicted contact data were randomly generated. The overall size of the hypothetical sequence and the ratio of the sizes of its two domains were varied. For each combination, 400 random data sets were generated each containing 1820 simulated predictions, corresponding roughly to the mean number of predictions obtained for real data (Table I). Various degrees of inter-domain region depletion were tested from 0.1 (inter-domain contact density is one tenth that of the intra-domain regions) to 1.0 (inter- and intra-domain regions have the same contact density). Analysis of these data was carried out using parameters later found to be optimum for real data (minimum domain size of 40 residues, smoothing window of nine residues).
Threading experiments were carried out using the hybrid methods of Fischer (Fischer, 2000) at the Bioinbgu server (http://www.cs.bgu.ac.il/~bioinbgu), using the 3D-PSSM program (Kelley et al., 2000
; http://www.bmm.icnet.uk/~3dpssm) and using Genthreader (Jones, 1999
; http://insulin.brunel.ac.uk/psipred). All PCD profile calculations and analyses were carried out using programs written in YABasic (http://www.yabasic.de) on PCs.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The premise of structural domain prediction using covariance analysis is illustrated in Figure 1. Since domains are, by definition, more compact than multi-domain proteins, in a residue contact map constructed from structural data, the inter-domain region(s), shaded in Figure 1
, will be more sparsely populated than the intra-domain regions. Given sufficiently accurate predictions from covariance analysis a predicted contact map should also possess a depleted zone corresponding to the inter-domain region. In the simplest case of proteins consisting of two sequential structural domains (i.e. the polypeptide chain folds first one domain and then the other), it then only remains to search for a single domain boundary. This can readily be carried out by constructing a profile containing the inter-domain PCD resulting from envisioning each possible domain boundary. In favourable cases the true domain boundary would be located near a local minimum of the profile, ideally that with the lowest profile value. (Note that in Figure 1
, and throughout this article, profile values are multiplied by 1000 for convenience.) For some applications, such as fold recognition, several putative domain boundaries could be tested experimentally so that location of the true domain boundary within 15 residues of the top three local minima (LM13) would still signify a useful result. Using these two criteria of success, the potential of this kind of analysis was measured using simulated data.
|
In order to examine the feasibility of the approach outlined above simulated sets of predicted contact data were randomly generated (see Materials and methods). Figure 2a shows the number of cases for each parameter set in which the local minimum with the lowest profile value gave an accurate prediction (within 15 residues of the actual domain boundary) and Figure 2b
the number of cases in which none of the three local minima with the lowest profile values corresponded to the domain boundary. The results show that under the most favourable conditions of maximal predicted inter-domain prediction zone depletion, accurate identification of domain boundaries by profile analysis is feasible. In the most favourable case (the hypothetical protein with two domains of 200 residues each), LM1 corresponds well to the domain boundary in 368/400 cases whereas the boundary is outside 15 residues from LM13 in just 29 cases. For the hardest hypothetical protein (with domains of 100 and 300 residues), LM1 indicates the domain boundary correctly in 290 cases and none of LM13 mark the boundary in 56 cases. In all cases, the highly successful predictions decrease in number (Figure 2a
) and the failures increase (Figure 2b
) as the inter-domain contact density approaches that of the intra-domain regions. It is also clear that predictions for small proteins are more successful than those for large proteins, presumably since the number of possible domain boundaries is fewer in the cases of the former. These results are expected, but the behaviour of different hypothetical proteins differs in surprising ways. For example, from similar success rates at an inter-domain depletion factor of 0.1 (316 and 365 for the 130/270 and 50/150 proteins, respectively), the success rate for the 130/270 domain hypothetical protein drops to just 138 at a depletion factor of 0.4 whereas for the 50/150 protein success is maintained in 342 cases.
|
A data set of known two-domain proteins
In order to apply the contact density profile method for domain boundary prediction to real proteins a data set of multiple sequence alignments was assembled for families of proteins known to have exactly two sequential domains, and covariance analysis carried out (see Materials and methods). These proteins represent the simplest cases addressable by the PCD method. Failure with these would imply that more complicated domain arrangements, requiring more complicated algorithms, would be intractable to PCD analysis. The HMM methods used to assemble the alignments are particularly suitable for this work since the alignments are of good overall quality (Karplus et al., 1999) and the technique is capable of effectively identifying distant homologues (Park et al., 1998
), whose presence is known to improve the quality of the results of this type of analysis (Olmea and Valencia, 1997
). Some fundamental characteristics of the data set are shown in Table I
. The members of the test data set are highly heterogeneous, with the founder members sharing no more than 25% pairwise sequence identity. The size of the founder members varies widely from 116 to 601 residues with a mean of 257 residues and a SD of 121 residues. Similarly, domain sizes vary from 29 to 354 residues, mean ± SD of 122 ± 71 residues. Both the number of sequences and their mean pairwise sequence identity within each alignment vary widely. Alignments with fewer than five or more than 650 sequences were discarded since in the former case predicted contacts from covariance analysis would be of very poor quality, and in the latter case computational demands would be exorbitant. Of the retained alignments, membership ranges from five to 609 sequences. Sequence variation, expressed as mean pairwise sequence identity between alignment members, ranges from 18.9 to 95.9, mean ± SD of 47.6 ± 17.4. Analysis of these alignments with the PREDICT program for correlated mutation analysis yields between nine and 36 contact predictions, mean ± SD of 19 ± 6.0. The depletion factor of the inter-domain predicted contact region relative to the PCD of the whole contact map was calculated for each contact prediction set using the known domain boundaries. The results varied from 0 to 2.08, mean ± SD of 0.87 ± 0.50 with numbers above 1.0 indicating that the inter-domain predicted contact region is more densely populated than the contact map as a whole. Depletion factors above 1.0 reflect inaccurate contact predictions and could not exist for maps of true contacts. The mean depletion factor of 0.85 indicates only a very modest average depletion, considering the importance of this factor, as shown using simulated data (Figure 2
). It is also worth noting that depletion factors of 0 (Table I
) were often the result of few contact predictions being made, or the concentration of the predictions just in one domain, rather than the ideal style of distribution shown in Figure 1
.
Results for two-domain proteins and statistical analysis
For each choice of data set (see Materials and methods), weighting and calculation type predictions were made using various smoothing window sizes and assumed minimum domain sizes (see Materials and methods). Adoption of a smoothing window was necessary in order that the many small local minima resulting from the sparse predicted contact data were ignored in favour of the larger local minima. The use of an assumed minimum domain size helped reduce the problem of the ragged ends present in most alignments. Since the covariance calculation ignores positions with more than 10% gaps, these ragged ends would lead to a lack of predictions for the terminal regions, and hence erroneous areas of low PCD in the resultant profiles.
Several results were monitored including the number of times LM1 was the closest local minimum to the actual domain boundary, the mean number of residues between LM1 and the domain boundary, this same mean for the subset of cases when LM1 was the closest LM to the domain boundary and the number of correct predictions where LM1 lay within 15 residues of the domain boundary. In order to assign statistical significance to these results, they were compared with mean values for corresponding randomized data sets (see Materials and Methods).
Analysis showed that inclusion of additional lower confidence predictions lowered prediction accuracy, monitored as above (data not shown); the default PREDICT output performed as well as any cut-off data set. Apparently the additional points on the PCD map, which might be expected to enhance the occasionally sparse data distribution, are not sufficiently accurate to justify inclusion. Other experiments showed that using the assigned confidence values in the default PREDICT output had negligible effect on the accuracy of the predictions (data not shown), perhaps because they generally have similar high values (0.75 ± 0.09 for all default predictions). Therefore, equally weighted default PREDICT results were used exclusively for further analysis.
Table II shows the performance of the PCD method applied to the real default data sets compared with mean values from predictions made, using identical methodology, for 10 randomized contact lists. All values are means taken from 35 analyses using smoothing windows of 7, 9, 11, 13, 15, 17 or 19 residues in combination with assumed minimum domain sizes of 25, 30, 35, 40 or 45 residues. Entries in the table represent real/random factors so that values <1 signify statistically significant improvements in the LM1 distance from boundary columns whereas values >1 imply better than random performance in the remaining columns. Table II
clearly shows that the predictions made by PCD profiling are statistically better than random, albeit modestly so. For example, LM1 is the closest LM to the true domain boundary in up to twice as many cases as calculated from randomized data. It is also notable that the number of LM1 close hits reaches a value more than double that expected by chance. Further inspection of the results revealed a heterogeneous mixture of remarkably accurate predictions with others of varying inaccuracy. Therefore, a search was made for a way of identifying the successful predictions.
|
A correlation was noted between a low number of profile minima and low distances between LM1 and the true domain boundary (data not shown). For profiles containing a single local minimum, LM1 errors were in the range 27. When two local minima were present, the maximum LM1 error was 26, and the trend seems to continue for larger numbers of minima. The depth of each individual local minimum also correlated with their distance from the actual domain boundary. When the error is plotted against local minimum depth for up to five lowest local minima, large depths are associated solely with more accurate predictions (data not shown). Low depth local minima are associated with errors of all sizes. When the depth of each local minimum is combined with the number of minima of the profile from which it came, the tail of the new graph is even more marked; predictions characterized by large depth/number of local minima values are even better associated with low local minimum-domain boundary errors (data not shown). Using this LM1 depth/number of local minima formulation, parameters were re-examined in order to see which combination enabled the identification of the largest number of accurate predictions.
Overall, the most effective parameter set for the identification of the best predictions comprises an assumed minimum domain size of 40 residues and a smoothing window of nine residues. Figure 3 illustrates the effect of varying these ideal parameters (Figures 3a and b
), and also a comparison of the LM1 depth/number of local minima criterion with the simpler LM1 depth measure (Figure 3c
). In Figure 3
, a line is drawn for each different set of parameters, resulting from the application of different LM1 depth/number of local minima cut-offs in the range 0.64.0, each leading to a certain number of predictions below the cut-off with a certain mean error. LM1 depth cut-offs in the range 0.150.6 were used.
|
|
A final prediction of domain boundaries for the test data set was made and analysed to determine factors associated with successful and unsuccessful predictions. Experiments with simulated data (Figure 2) suggested that large proteins of similar domain size and low inter-domain depletion factor should be the most difficult cases. The dependence of the domain boundary prediction on covariance analysis suggests that factors associated with accurate contact predictionssequence diversity and number of sequences in the alignment (Olmea and Valencia, 1997
)should also have a positive influence on domain boundary prediction accuracy.
Figure 4 shows the relationship of prediction error to these characteristics. Indeed, domain boundaries in larger proteins are in general predicted less well (Figure 4a
). Applied linear regression leads to a weak but significant correlation coefficient of 0.49. However, it is worth noting that accurate predictions are present for some large proteins. Surprisingly, there seems to be no relationship between prediction accuracy and the ratio of domain sizes (data not shown), with the two least accurate predictions made for proteins with domains of unequal sizes. However, the number of predictions for proteins of unequal domain size may be too small to draw reliable conclusions. Only a very weak relationship between actual inter-domain depletion and LM1 prediction error was evident (Figure 4b
). However, examination of just those cases where LM1 indicated the domain boundary with an error of less than 15 residues highlights the importance of inter-domain PCD depletion. When all the predictions are ranked by inter-domain PCD depletion and divided into two groups, just four accurate predictions are found in the cases with least depletion whereas 13 accurate predictions are made where inter-domain PCD is more depleted. Surprisingly, only weak correlations between prediction error and sequence variability (mean pairwise percent sequence identity within the alignments; Figure 4c
) and between within error and the number of sequences in the alignment (Figure 4d
) were observed. Alignments with fewer than 15 sequences are routinely thought to be inadequate for the prediction of residue contacts (Olmea and Valencia, 1997
). To a certain extent our results confirm this trend since of the eight alignments containing fewer than 15 sequences used, four lead to non-predictions. However, two others, containing seven and 11 sequences, lead to predictions with errors of just 13 and 14 residues. Hence, useful information may be derived from some alignments of rather few sequences.
|
Examples of successful and unsuccessful predictions are shown in Figures 5 and 6, respectively. As the case of Pseudomonas 2,3-dihydroxyphenyl 1,2-dioxygenase (1DHY0) shows, a high degree of inter-domain contact depletion is not essential for an accurate prediction. The example of human salivary
-amylase (1SMD0) shows that accurate predictions can be made for the more difficult cases (Figure 2
) of larger proteins. Unsuccessful predications can be divided into two categoriesthe non-predictions such as pertussis toxin (1PRTB), and the inaccurate predictions such as catabolite gene activator protein (2CGPA). In these cases the blame presumably lies with the intrinsically limited accuracy of the contact predictions (Olmea and Valencia, 1997
), exacerbated in many of the cases of non-predictions by the limited number of sequences available.
|
|
In order to assess the performance of the PCD profile method on single-domain proteins, two further data sets were generated, the first of one-domain proteins of the most typical length (141162 residues) and the second of larger one-domain proteins with lengths in the range 169567 residues (see Materials and methods). The results of the PCD profile analysis immediately showed that, whereas effective in highlighting more accurate predictions for two-domain proteins, the LM1 depth/number of local minima and LM1 depth/number of local minima characteristics are not capable of discriminating against false predictions made for one-domain proteins; predictions were made in both data sets at levels exceeding the cut-offs shown in Table III.
Comparison of false predictions for one-domain proteins and correct predictions for two domains revealed one characteristic with some discriminatory capability. For correct two-domain protein predictions above the LM1 depth/number of local minima value of 0.1, most (six out of nine) assignments lead to divisions of the contact map with both intra-domain regions populated. In contrast, only one of the six false predictions for the typical length one-domain data set fulfilled this criterion. For the larger single-domain protein data set, the single prediction above this cut-off did not lead to two populated intra-domain regions of the predicted contact map.
Therefore, it seems that combination of the `both intra-domain regions populated' rule enables many false predictions made for single-domain proteins to be discounted. Nevertheless, the false predictions remain a problem, particularly for smaller proteins. This is an important consideration when structural studies are projected, but is less significant for threading studies which generally require little time.
Application of the technique to CASP3 targets
Four of the targets of the CASP3 blind structure prediction contest (Moult et al., 1999) consisted of two domainstarget IDs T0044, T0063, T0071 and T0083, now corresponding, respectively, to PDB chains 1QMHA, 1BKB0, 1B9KA and 1DW9A. Setting aside 1QMH, in which one domain is inserted into the other, thereby complicating analysis (Russell and Ponting, 1998
), PCD profiles were calculated for the remaining three chains to see if predicted domain boundary definitions would have helped fold assignment. For 1BKB0, 1B9KA and 1DW9A, respectively, the alignments contained 71, 15 and 10 sequences, sharing 3347% mean sequence identity, leading to 10, 13 and 19 predicted contacts.
For 1BKB0 and 1B9KA, predictions were made with characteristics suggesting high reliability. The LM1 depth/number of local minima values for these profiles were 0.49 and 0.68, respectively. Comparison of the predicted and actual domain boundaries revealed these predictions to be correct to within nine and three residues, respectively. In contrast, the LM1 depth/number of local minima value for 1DW9A was just 0.02, not indicative of a reliable result, and indeed the prediction was incorrect. The case of 1DW9A may not have been helped by the fact that the C-terminal domain does not form a compact structure, instead intertwining with corresponding domains in symmetry-related subunits (Walsh et al., 2000).
Encouraged by the two strong predictions, threading experiments were carried out to compare results for the sequences of entire chains, predicted domains and actual domains. The results of 3D-PSSM analysis (Kelley et al., 2000) are summarized in Table IV
. Correct results for the N- and C-terminal domains of 1BKB0 were SH3-like folds and OB folds, respectively, whereas for 1B9KA the N- and C-terminal domains resemble immunoglobulin folds and TATA-box-binding protein structural repeats, respectively (Murzin, 1999
).
|
Application of the technique to geminivirus AL1 protein
AL1 (also known as Rep), possessing approximately 260 residues, is the only protein required for replication of all geminiviruses (Elmer et al., 1988) and contains multiple biochemical activities including DNA binding (Fontes et al., 1992
). A series of experiments has culminated in the identification of the AL1 origin DNA-binding site and cleavage domain within residues 1116 and 1120, respectively (Gladfelter et al., 1997
; Orozco et al., 1997
).
An alignment of AL1 protein sequences was constructed from the parent sequence of bean golden mosaic virus (Gilbertson et al., 1991) using the same methods as applied to the test data set. It contained 117 sequences sharing a mean pairwise percentage sequence identity of 66 ± 15. Using the optimized parameter set a PCD profile was constructed and analysed. LM1 of this profile lay at residue 132 and had depth and depth/number of local minima characteristics of 0.18 and 0.03, respectively. Using Table III
the depth/number of local minima is not indicative of a reliable prediction, but the depth of LM1 corresponds to an average error of approximately 19 residues. In addition, the domain definition agrees very well with the functionally defined AL1 origin DNA-binding site domain from residues 1116 (Gladfelter et al., 1997
).
The two putative domains of the bean golden mosaic virus AL1 sequence were then subjected to threading experiments. The most significant results were obtained using the methods of Fischer (Fischer, 2000) which suggested a structural correspondence between the first AL1 domain and the C-terminal single-stranded DNA-binding domain of topoisomerase (1YUA; Yu et al., 1995
) which has a length of 122 residues. This domain belongs to the same SCOP superfamily (zinc ß-ribbon) as the single-stranded DNA-binding domains of DNA primases (1PFT; Pan and Wigley, 2000
) and transcriptional elongation factors (1TFI; Qian et al., 1993
). Therefore, the threading result matches well with the AL1 domain 1 sequence in terms of length (122 versus 116) and biochemical activity (single-stranded DNA binding).
The dependence of the threading result on sequence length is shown in Figure 7. It shows that as the length of the AL1 domain sequence supplied deviates from the topoisomerase domain length, the threading score drops rapidly, particularly in the direction of smaller sequences. When 40 too few or too many residues are analysed, the topoisomerase domain is no longer the highest scoring fold. These results confirm the sensitivity of threading to sequence length. In this case the PCD profiling method gave a domain size 10 residues larger than that producing the best threading results. However, even with this error the threading result was 96% of the best achievable.
|
With the advent of genome sequencing projects, the number of protein sequences in the databases is growing exponentially. With this deluge of sequence data comes the challenge of adequately annotating new sequences. Although new homology-independent methods are arriving (Marcotte, 2000), the bulk of current sequence annotation is based on identification of homology of new sequences with those already characterized. This enables the assignment of characteristics for the new sequence with a degree of accuracy dependent on degree of sequence similarity (Devos and Valencia, 2000
). New sequence analysis techniques will also contribute to improved functional annotation (Gallet et al., 2000
; Hannenhalli and Russell, 2000
). The sensitivity of sequence comparison techniques is continually improving (Altschul et al., 1997
; Karplus et al., 1998
) but there remain cases where sequence divergence has occurred to such an extent that truly homologous proteins are no longer detectable by sequence comparisons (Rost, 1999
). Threading methods help significantly in these cases (Fischer and Eisenberg, 1997
) but suffer from their sensitivity to the length of supplied sequence; correct matches to a known domain structure may not be obtained if the length of sequence supplied differs markedly from the size of the domain (Fischer et al., 1999
; Table IV
and Figure 7
). By supplying a list of possible domain boundaries and means to judge their reliability (Table III
) the PCD profile methodology outlined in this article should help in these cases.
X-ray crystallographic and NMR studies of isolated protein domains offer a route for structural analysis of proteins whose size, flexibility or other characteristics render whole-protein analysis impossible. Accurate domain boundary knowledge is crucial in these cases to avoid the presence of unstructured tails or domain destabilization through removal of structurally important regions. Whereas the overall success rate of the PCD profile method (35% of top predictions within 15 residues) is low, a subset of accurate predictions can be identified. The top six, scoring 2.8 or more by the LM1 depth/number of local minima criterion (Table III), have errors from CATH domain definitions of 07 residues. Therefore, they correspond to essentially correct assignments, especially recalling the different results generated by different domain assignment programs for the same supplied structure (Holm and Sander, 1994
; Sidduqui and Barton, 1995
; Swindells, 1995
). These predictions could safely have been used as the basis for structural studies (Figure 8
).
|
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bantscheff,M., Weiss,V. and Glocker,M.O. (1999) Biochemistry, 38, 1101211020.[CrossRef][ISI][Medline]
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L.L. (2000) Nucleic Acids Res., 28, 263266.
Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) Nucleic Acids Res., 28, 235242.
Bocquier,A.A., Potts,J.R., Pickford,A.R. and Campbell,I.D. (1999) Structure Fold Des., 7, 14511460.[Medline]
Bu,W.S., Feng,Z.P., Zhang,Z. and Zhang,C.T. (1999) Eur. J. Biochem., 266, 10431049.
Chan,C.L., Lonetto,M.A. and Gross,C.A. (1996) Structure, 4, 12351238.[ISI][Medline]
Choulier,L., Lafont,V., Hugo,N. and Altschuh,D. (2000) Proteins, 41, 475484.[CrossRef][ISI][Medline]
Cohen,S.L. (1996) Structure, 4, 10131016.[ISI][Medline]
Corpet,F., Servant,F., Gouzy,J. and Kahn,D. (2000) Nucleic Acids Res., 28, 267269.
Devos,D. and Valencia,A. (2000) Proteins, 41, 98107.[CrossRef][ISI][Medline]
Elmer,J.S., Brand,L., Sunter,G., Gardiner,W.E., Bisaro,B.M. and Rogers,S.G. (1988) Nucleic Acids Res., 16, 70437060.[ISI][Medline]
Fischer,D. (2000) Pacific Symp. Biocomputing. Hawaii, pp. 119130.
Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 1192911934.
Fischer,D., Barret,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., Karplus,K.J., Kelley,K.A., Maccallum,R.M., Pawowski,K. et al. (1999) Proteins, (Suppl. 3), 209217.
Fontes,E.P.B., Luckow,V.A. and Hanley-Bowdoin,L. (1992) Plant Cell, 4, 597608.
Gallet,X., Charloteaux,B., Thomas,A. and Brasseur,R. (2000) J. Mol. Biol., 302, 917926.[CrossRef][ISI][Medline]
Gilbertson,R.L., Hidayat,S.H., Martinez,R.T., Leong,S.A., Faria,J.C., Morales,F.J. and Maxwell,D.P. (1991) Plant Dis., 75, 336342.[ISI]
Gladfelter,H.J., Eagle,P.A., Fontes,E.P.B., Batts,L. and Hanley-Bowdoin,L. (1997) Virology, 239, 186197.[CrossRef][ISI][Medline]
Gobel,U., Sander,C., Schneider,R. and Valencia,A. (1994) Proteins, 18, 309317.[ISI][Medline]
Gracy,J. and Argos,P. (1998a) Trends Biochem. Sci., 23, 497497.[CrossRef][ISI][Medline]
Gracy,J. and Argos,P. (1998b) Bioinformatics, 14, 174187.[Abstract]
Hannenhalli,S.S. and Russell,R.B. (2000) J. Mol. Biol., 303, 6176.[CrossRef][ISI][Medline]
Hobohm,U., Scharf,M., Schneider,R. and Sander,C. (1992) Protein Sci., 1, 409417.
Holm,L. and Sander,C. (1994) Proteins, 19, 256268.[ISI][Medline]
Jones,D.T. (1999) J. Mol. Biol., 287, 797815.[CrossRef][ISI][Medline]
Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846856.[Abstract]
Karplus,K., Barrett,C., Cline,M., Diekhans,M., Grate,L. and Hughey,R. (1999) Proteins, (Suppl. 3), 121125.
Kelley,L.A., MacCallum,R.M. and Sternberg,M.J.E. (2000) J. Mol. Biol., 299, 501522.[CrossRef]
Kraulis,J. (1991) J. Appl. Crystallogr., 24, 946950.[CrossRef][ISI]
Kuroda,Y., Tani,K., Matsuo,Y. and Yokoyama,S. (2000) Protein Sci., 9, 23132321.[Abstract]
Larsen,T.M., Laughlin,L.T., Holden,H.M., Rayment,I. and Reed,G.H. (1994) Biochemistry, 33, 63016309.[ISI][Medline]
Larson,S.M., DiNardo,A.A. and Davidson,A.R. (2000) J. Mol. Biol., 303, 433446.[CrossRef][ISI][Medline]
Lesk,A.M. and Chothia,C. (1980) J. Mol. Biol., 136, 225270.[ISI][Medline]
Marcotte,E.M. (2000) Curr. Opin. Struct. Biol., 10, 359365.[CrossRef][ISI][Medline]
Matthews,B.W. (1997) Methods Enzymol., 276, 310.[CrossRef][ISI]
Moult,J., Hubbard,T., Fidelis,K. and Pedersen,J.T. (1999) Proteins, (Suppl. 3), 26.
Murzin,A.G. (1999) Proteins, (Suppl. 3), 88103.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Olmea,O. and Valencia,A. (1997) Fold. Des., 2, S25S32.[ISI][Medline]
Olmea,O., Rost,B. and Valencia,A. (1999) J. Mol. Biol., 295, 12211239.[CrossRef]
Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, 10931108.[ISI][Medline]
Orengo,C.A., Bray,J.E., Hubbard,T., LoConte,L. and Sillitoe,I. (1999) Proteins, 37, 149170.[CrossRef][Medline]
Orozco,B.M., Miller,A.B., Settlage,S.B. and Hanley-Bowdoin,L. (1997) J. Biol. Chem., 272, 98409846.
Ortiz,A.R., Kolinski,A., Rotkiewicz,P., Ilkowski,B. and Skolnick,J. (1999) Proteins, 37, 177185.[CrossRef][Medline]
Owen,D.J., Papageorgiou,A.C., Garman,E.F., Noble,M.E. and Johnson,L.N. (1995) J. Mol. Biol., 246, 374381.[CrossRef][ISI][Medline]
Pan,H. and Wigley,D.B. (2000) Structure Fold Des., 8, 231239.[ISI][Medline]
Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, 12011210.[CrossRef][ISI][Medline]
Pazos,F., Helmer-Citterich,M., Ausiello,G. and Valencia,A. (1997) J. Mol. Biol., 272, 113.[CrossRef][ISI][Medline]
Potts,J.R., Bright,J.R., Bolton,D., Pickford,A.R. and Campbell,I.D. (1999) Biochemistry, 38, 83048312.[CrossRef][ISI][Medline]
Qian,X., Gozani,S.n., Yoon,H., Jeon,C.J., Agarwal,K. and Weiss,M.A. (1993) Biochemistry, 32, 99449959.[ISI][Medline]
Rossmann,M.G. and Argos,P. (1981) Annu. Rev. Biochem., 50, 497532.[CrossRef][ISI][Medline]
Rost,B. (1996) Methods Enzymol., 266, 525539.[CrossRef][ISI][Medline]
Rost,B. (1999) Protein Eng., 12, 8594.
Rost,B. and Sander,C. (2000) 3rd generation prediction of secondary structure. In Webster, D.M. (ed.), Predicting Protein Structure: Methods and Protocols. Humana Press, pp. 7195.
Russell,R.B. and Ponting,C.P. (1998) Curr. Opin. Struct. Biol., 8, 364371.[CrossRef][ISI][Medline]
Sali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779815.[CrossRef][ISI][Medline]
Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) Nucleic Acids Res., 28, 231234.
Shindyalov,I.N., Kolchanov,N.A. and Sander,C. (1994) Protein Eng., 7, 349358.[Abstract]
Sidduqui,A.S. and Barton,G.J. (1995) Protein Sci., 4, 872884.
Sticht,H., Pickford,A.R., Potts,J.R. and Campbell,I.D. (1998) J. Mol. Biol., 276, 177187.[CrossRef][ISI][Medline]
Swindells,M.B. (1995) Protein Sci., 4, 103112.
Taylor,W.R. and Hatrick,K. (1994) Protein Eng., 7, 341348.[Abstract]
Walsh,M.A., Otwinowski,Z., Perrakis,A., Anderson,P.M. and Joachimiak,A. (2000) Structure Fold Des., 8, 505514.[Medline]
Wheelan,S.J., Marchler-Bauer,A. and Bryant,S.H. (2000) Bioinformatics, 16, 613619.[Abstract]
Yu,L., Zhu,C.X., Tse-Dinh,Y.C. and Fesik,S.W. (1995) Biochemistry, 34, 76227628.[ISI][Medline]
Received April 25, 2001; revised September 27, 2001; accepted November 1, 2001.