1 Laboratory of Biochemistry, Wageningen Agricultural University, Dreijenlaan 3, 6703 HA Wageningen, 2 Target Discovery Unit and 3 Department of Molecular Design and Informatics, NV Organon,P.O. Box 20, 5340 BH Oss, The Netherlands
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: automated homology modeling/Caenorhabditis elegans/fold recognition/nuclear receptors/sequence annotation
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Nuclear hormone receptor genes are commonly identified and annotated on the basis of the highly conserved, N-terminal DNA binding domain (DBD) (Clarke and Berg, 1998). Not every hormone receptor contains a DBD, however, and not every protein containing a C4-type Zinc-finger DBD also features a hormone-binding domain. In addition, reliable identification of the presence of a ligand-binding domain (LBD) is complicated by the relatively low sequence identities observed among LBDs. For instance, annotations for 233 nuclear receptors in Caenorhabditis elegans were obtained using hidden Markov models (HMMs) for the DBD (Clarke and Berg, 1998
). Only after applying hidden Markov models for the LBD, eventually incorporating also C.elegans sequences that scored well in previous searches, could an LBD fold tentatively be assigned to 90% of the predicted proteins. The remaining 10% of the predicted proteins with a DNA-binding domain remained without further annotation for a ligand-binding domain.
When sequence identity alone does not allow for the unambiguous annotation of a gene, a numerical estimate of the compatibility of a sequence with a three-dimensional (3D) protein fold provides an independent measure for annotating sequences. Several different approaches to obtaining such an estimate have been described and are used in fold recognition software (Torda, 1997). If a 3D model can be obtained for a protein sequence, model validation tools can be used to assess its suitability for a particular fold.
Homology or comparative modeling allows for the automatic generation of 3D structural models for protein sequences by using the structure(s) of sequence homologues as templates. Homology modeling can be expected to yield meaningful 3D models for sequencetemplate pairs with sequence identities over 30%. Sánchez and Sali (1998) followed such a strategy for gene annotation by generating homology models for all sequences in the Saccharomyces cerevisiae genome that shared sequence homology with structures in the protein databank (PDB).
As the basic fold of the nuclear hormone receptor LBD has been known since the elucidation of several structures within this family, homology modeling is an attractive technique to generate and evaluate models for sequences of putative receptors in C.elegans. Although the sequence identity among hormone-binding domains of nuclear receptors is often below 20%, the strong conservation of the adjacent N-terminal DNA binding domain allows for the localization of the putative hormone-binding domain and its alignment with other sequences.
Here we present results on automated homology modeling and model assessment for a database consisting of known and (partially) annotated nuclear receptor sequences from mammals and C.elegans, respectively. The mammalian sequences serve both as an internal control and to calibrate the criteria for annotation of non-mammalian sequences. We discuss the importance of using different LBD crystal structures for model building and their role in the annotation of putative hormone receptor sequences.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Multiple sequence alignments were produced with ClustalW and its graphical interface ClustalX using default parameters. First, all sequences in the database were aligned in a multiple alignment. Sequences that caused the insertion of large gaps in the Zinc-finger domain region were removed. Sequences that did not contain Zinc-finger motifs were kept if they did not disrupt the alignments. The portion of aligned sequences C-terminal of the Zinc-finger domains was saved for further alignment.
Subsequently, a profile of annotated and aligned nuclear hormone receptors was created from human, Drosophila melanogaster and C.elegans nuclear receptor LBD sequences. This profile was used to align the LBD sequences in our database. Sequences that disturbed the N-terminal region of the LBD sequences, containing the strongly conserved F/WAK tripeptide sequence, were removed. The resulting alignment was trimmed to delete all amino acids before and after the ligand-binding domains as delimited by the crystal structures. Finally, a profile of the sequences of the four known structures was created and the remaining sequences (206) were individually aligned with this profile.
Nuclear hormone receptor crystal structures were chosen from those available at the initiation of this project in the PDB. For the estrogen receptor (ER), progesterone (PR), retinoic acid- (RAR) and retinoid X-
receptor (RXR) ligand-binding domains PDB entries 2ERD (Shiau et al., 1998
), 1A28 (Williams and Sigler, 1998
), 2LBD (Renaud et al., 1995
) and 1LBD (Bourguet et al., 1995
) were used, respectively. The pairwise sequence identities of these LBD sequences were in the range 1431%. Using different structures for the same protein as templates (which are, for instance, available for the ER) did not yield significant differences in Z-scores. Therefore, only one representative structure was used per receptor. All ligands and waters were removed prior to their application as templates for homology modeling. Models were built using Modeler 5.0 (Sali and Blundell, 1993
) as distributed with Quanta 98 (MSI, San Diego, CA) using default parameters and no further optimization. The `very_fast' and `Align2D' options in Modeler 5.0 were used with default parameters (Sanchez and Sali, 1998
). These options proved useful for speeding up model generation by a factor of 2 and utilize template-derived secondary structure information to improve the initial alignments, respectively. The Z-score as calculated by ProsaII version 3.0 (Sippl, 1993
) was used to assess the quality of the models. No correction was made for differences in sequence length, as these were small after trimming the sequences to the length of those of the crystal structures. In our hands, the ProsaII Z-score performed better in discriminating true hormone receptors from unrelated sequences in a comparison with the LüthyEisenberg 3D/1D profile (Lüthy et al., 1992
) and the packing quality assessment of WhatCheck (Vriend et al., 1993). All further analysis was performed with in-house developed software, Quanta 98 and Excel (Microsoft).
PSI-BLAST (Altschul et al., 1997) sequence alignment searches were performed for comparison using Blastp 2.0. Four Blast searches were performed using the four template sequences individually against the LBD database to which the template sequences had been added. This allows for the automatic construction of a profile based on the four template sequences by PSI-BLAST, similar to the use of a profile in aligning sequences for model generation. For each template sequence three iterations were performed and profiles were accumulated from all sequences with E-values smaller than 107. The final selection of sequences identified by PSI-BLAST to be similar to the template LBD were those with E-values smaller than 104. All sequences were subsequently collected and reduced to a list of unique sequences with reasonable similarity to the four templates.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Figure 1A displays the cumulative percentage of the annotated portion of the database that obtained a certain Z-score, depending on which receptor structure was used as a template. In addition, the percentages of annotated sequences that obtained a certain mean Z-score, averaged over all four structures, and the lowest Z-score obtained with any of the four structural templates are plotted. A low Z-score for a model is assumed to allow for the reliable assignment of the LBD fold to a sequence (Sippl, 1993
). As shown in Figure 1A
, the four individual LBD structures assign this fold to a smaller percentage of the sequences, compared with the strategy of selecting the best score obtained over all four templates. Selecting the best Z-score is warranted, as sequences will produce better models when these are based on protein structures that are closer in terms of sequence identity or function. The mean Z-score curve is, as expected, intermediate to those of the individual templates. Although Modeler 5.0 allows for the use of all four templates simultaneously in generating models, models generated in this fashion had less favorable Z-scores than the best of all four individual models. The average Z-score of the models obtained when using all four templates simultaneously was 3.9, while the average Z-score obtained using any of the four templates independently was 5.9. In this respect, the ProsaII Z-score is sufficiently sensitive to identify small differences in model accuracy. When all four templates are used to derive models for the template sequences themselves, an average Z-score of 8.1 and an average r.m.s.d. from the crystal structure C
-positions of 1.6 Å are obtained. When the best model obtained with any of the four templates is selected, these numbers are 9.4 and 0.16 Å, respectively. The fact that using multiple structures simultaneously for homology modeling yields models of lower quality than selecting the best model derived from individual templates is thought to be due to the structural diversity and low sequence identity among the nuclear hormone receptor LBD structures.
|
Figure 1C shows a plot of the best Z-scores obtained using any of all four templates versus the cumulative percentage of all models that obtained such a Z-score. The distributions for sequences with annotations in the SwissProt database (Bairoch and Apweiler, 1996
) and sequences without annotations are shown in the same figure. Models for unannotated C.elegans sequences have less favorable Z-scores than models based on annotated mammalian sequences. This reflects the evolutionary distance between the C.elegans sequences and the mammalian template structures used in this study.
Comparison of the two curves in Figure 1C allows for the determination of a threshold below which sequences can be assumed to be reliably annotated to the nuclear receptor LBD fold. Figure 1C
shows that not all annotated sequences obtain a low Z-score. The generation of poor quality models for this fraction of the database is due to misalignments caused by low sequence identity and the absence of representative crystal structures for these sequences. Nevertheless, if one assumes that 90% of the annotated sequences can be modeled correctly, a Z-score threshold of approximately 4.5 can be applied to the sequences without annotation. This would provide annotations for ~70% of the predicted C.elegans proteins. The Z-score threshold of 4.5 used here compares well with tests that involved modeling the sequences of the four template structures and two unrelated sequences with different 3D folds [taken from PDB entries 1B6U(A) and 1BB9] on all four LBD structures. The resulting models based on unrelated sequences had Z-scores higher than 3 and all models derived from the LBD sequences scored below 6.
When PSI-BLAST is applied to the same database, using the four template sequences individually as starting points, 192 unique sequences are eventually aligned with E-values below 104. Of these 192 sequences, 126 were also found using our model building approach. This confirms that sequence alignment protocols and approaches involving homology modeling contribute different information to the process of gene annotation and identify slightly different sets of sequences. Although the somewhat arbitrary choice of threshold in both methods makes a direct comparison difficult, PSI-BLAST performs comparably to our fold recognition approach in terms of the total number of sequences annotated. Interestingly, whereas with PSI-BLAST the RXR sequence identifies the largest number of potential ligand-binding domains (190 sequences compared with 108 for the RAR), the RAR structure allows for the annotation of the largest number of sequences on the basis of homology models (Figure 1B). These differences indicate that sequence annotation through homology modeling is not dominated by sequence homology but actually introduces new, structure-derived information in the annotation process.
Although there is no absolute Z-score criterion defining a correctly annotated sequence, the fact that this analysis yields a sliding scale is not necessarily detrimental. A sliding scale allows the user to select the highest-ranking genes first to be added to a sequence profile or hidden Markov model. Additional sequences could be added in order of their Z-score ranking. As these sequences are selected on structural criteria rather than sequence similarity, they can add new information to the profile or hidden Markov model that cannot be obtained from sequence similarity analysis alone.
Among the top ranking models for C.elegans predicted proteins, a number of sequences are found without previous annotation for an LBD fold by hidden Markov models in the Pfam database (Bateman et al., 1999). The three highest-ranking sequences are the following Pfam entries (Z-scores, structural template and sequence identities): C56E10.4 (7.15, RAR, 16.8%), C06B8.1 (6.90, ER, 12.2%) and ZK418.1 (6.10, RAR, 17.7%). One model with less than 10% sequence identity with the template structure (Pfam entry T07C5.4, based on the RAR LBD with 9.6% identity) was still assigned a Z-score of 5.2. The addition of such sequences could broaden the scope of LBD sequence profiles and the Pfam hidden Markov models further.
In conclusion, large-scale homology modeling of nuclear receptor LBDs in combination with model assessment tools provides a useful addition to current gene annotation strategies. The resulting 3D structure-based annotations can be used to refine sequence-based approaches further. The application of homology modeling approaches depends heavily, however, on the availability of high-quality, representative protein structures and sufficient sequence identity between genes and templates. When structures of diverse members of a protein family are available it is advisable to use all of them individually as templates for comparative modeling, especially when mutual sequence identities are low. The preference of a template structure for certain genes may hint at a similar structure and function.
![]() |
Notes |
---|
![]() |
Acknowledgments |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bairoch,A. and Apweiler,R. (1996) Nucleic Acids Res., 24, 2125.
Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer,E.L. (1999) Nucleic Acids Res., 27, 260262.
Beato,M., Herrlich,P. and Schutz,G. (1995) Cell, 83, 851857.[ISI][Medline]
Bourguet,W., Ruff,M., Chambon,P., Gronemeyer,H. and Moras,D. (1995) Nature, 375, 377382.[ISI][Medline]
Clarke,N.D. and Berg,J.M. (1998) Science, 282, 20182022.
Kliewer,S.A, Lehmann,J.M and Wilson,T.M. (1999) Science 284, 757760.
Lehmann,E.L. and D'Abrera,H.J.M. (1998) Nonparametrics: Statistical Methods Based on Ranks. Prentice-Hall, Englewood Cliffs, NJ.
Lüthy,R., Bowie,J.U. and Eisenberg,D. (1992) Nature, 356, 8385.[ISI][Medline]
Renaud,J.P., Rochel,N., Ruff,M., Vivat,V., Chambon,P., Gronemeyer,H. and Moras,D. (1995) Nature, 378, 681689.[ISI][Medline]
Sali,A. and Blundell,T.L. (1993) J. Mol. Biol., 234, 779815.[ISI][Medline]
Sánchez,R and Sali,A. (1998) Proc. Natl Acad. Sci. USA, 95, 1359713602.
Shiau,A.K., Barstad,D., Loria,P.M., Cheng,L., Kushner,P.J., Agard,D.A. and Greene,G.L. (1998) Cell, 95, 927937.[ISI][Medline]
Sippl,M.J. (1993) Proteins, 17, 355362.[ISI][Medline]
Sluder,A.E., Mathews,S.W., Hough,D., Yin,V.P. and Maina,C.V. (1999) Genome Res., 9, 103120.
Torda,A.E. (1997) Curr. Opin. Struct. Biol., 2, 200205
Vriend,G. and Sander,C. (1993) J. Appl. Crystallogr., 26, 4760.[ISI]
Williams,S.P. and Sigler,P.B. (1998) Nature, 393, 392396.[ISI][Medline]
Received November 22, 1999; revised April 7, 2000; accepted April 17, 2000.