1 Department of Chemistry and Biochemistry 0654 and 2 San Diego Supercomputer Center 0505, 3 Department of Pharmacology, University of California, San Diego, La Jolla, CA 92093, USA
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: Hidden Markov models/HMM/protein structure prediction/secondary structure/structural alignments/tertiary structure
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
This paper describes a program that we have developed, HMMSPECTR, that finds putative structural homologs for proteins with known primary sequences. The foundation of HMMSPECTR is the hypothesis that the structural information in protein sequences can be extracted from structural alignments.
![]() |
Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
The goal of the data warehouse construction was to cover the majority of known structures of proteins. The data warehouse of fold superfamilies was constructed using the SCOP fold classification (Murzin et al., 1995). We created and trained a set of HMMs using the program HAMMER (Eddy, 1998
). From each fold of SCOP we selected a typical representative: a title protein. The CE program (Shindyalov and Bourne, 1998
) was used to create structural alignments of proteins that have tertiary structures close to the title protein. We considered proteins structurally close if the Z-score reported by CE was >4. Multiple pairwise alignments constructed using CE contained 50800 proteins. The number of proteins in each alignment depends on the Z-score chosen to limit similarity of these proteins to the title protein. Constructed alignments had to include all members of the selected SCOP superfamily. In many cases we chose relatively loose Z-score cut-offs (<4) to obtain multiple alignments that included some proteins with structures sufficiently close to the title protein but not included in this SCOP superfamily or family (Figure 2
). This feature was introduced to build statistically rich HMMs. Use of a narrow set of proteins structurally very close to each other in the initial HMM restricts a major advantage of the HMM approach, i.e. estimation of probabilities of transitions between neighboring amino acids. Our goal thus became finding representatives of a specific fold or set of folds instead of finding representatives of a specific family of proteins (Figure 2
). We created the HMM corresponding to each set of structural alignments.
|
We created three libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER package and the third library contained parts (`partial HMM') of initial alignments. The first library included variants of HMM preparation with different gap-filter values from 0.1 to 0.9. The second library contained trained HMMs. The cyclic HMM training was done by using the initial HMM to create the next multiple alignment, which in turn wss used to prepare the HMM for the next step (Tsigelny et al., 2000). This procedure converged in 35 cycles. The search procedure then selected HMMs for which the score for specific target sequences grew during the training. In many cases, training increased the score with which these HMMs made specific predictions. The third library consisted of `partial HMMs', based on the observation of significant discontinuities in both the CE Z-score and the HMM scores for members of a family. `Partial HMMs' were obtained by splitting the family at the discontinuities.
Search procedure
Each HMM from the data warehouse (including trained and untrained HMMs) is tested for concordance with the probe sequence. If the system is not able to pick one with a reasonably high score even using trained HMMs, it shifts to the search of partial HMMs. Eventually it stops when the highest score is found.
To select the best final solution we compare the secondary structure of the best 10 candidates extracted from DSSP library (Kabsch and Sander, 1983) with the predicted secondary structure of the target protein. For secondary structure prediction we used a new method based on pattern recognition techniques (in preparation).
Table I illustrates effectiveness of our HMM training procedures on CASP 4 protein targets T0109, T0100 and T0087. The training procedures significantly increase the scores and, even more important, the length of predicted protein structures. We have to note that training does not improve the scores and lengths in all cases. In a number of cases we do not see any improvement. This usually means that the initial HMM is prepared properly and does not need further correction.
|
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The following changes were made to the program after CASP-4 meeting:
The results of our tests on the CASP4 targets are shown in Tables II and III. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases.
|
|
The details of our current protein structure prediction strategy using HMM score and Secondary Structure Prediction score are given using as an example CASP-4 target T89, (PDB code 1E4F, cell division protein FtsA from Thermotoga maritima).
Primary HMM search brought the following best results from three libraries:
In the pre-CASP-4 period we would just use the best prediction, 1QHA:B(917:80462). In the new version of the program g-library with increased number of filters selected 1HLU:A as a best prediction. Nevertheless, the best prediction by HMM-score would still remain 1QHA:B. Only taking in consideration the Secondary Structure Prediction Score makes it possible to predict in fully automated mode the right structure 1HLU:A:
The final prediction is made using the minimum sum of sorting scores of both HMM and SS predictions.
The final prediction of the tertiary structure for target T89 is protein 1HLU_A.
H2M = HMMscorexLr/Lt for positive HMM scores; H2M = HMMscorexLt/Lr for negative HMM scores, where Lt = length of target protein sequence. In the case when Lr = Lt, H2M = HMMscore. Secondary structure score is calculated starting from adding 1 for the first identical letters of secondary structure of a target and predicted secondary structure, adding 0.1 for each next non-interrupted identical letter and subtracting 0.1 for each gap. The sum is multiplied by the coefficient of reliability of prediction in each case, Kpr, which has values from 0.1 to 0.9. The resulting score if also multiplied by the coefficient taking into account the length of the region of correspondence.
HMMSPECTR is successful because it explicitly allows for two of the basic problems of predicting new structures from libraries of known structures. The basic assumptions in this process are that (1) the structures are properly classified and (2) they are properly aligned. When this is the case, the normally prepared HMMs in the data warehouse correctly classify target sequences to the appropriate folds. However, the classification of structures may not be sufficiently detailed to reflect the true sequence to structure code for some folds. In this case, the `partial HMMs' subdivide the classifications based on discontinuities of similarity scores in the originally classified data. This weakens the statistical power of the HMMs, so this method is used only when the other methods have returned ambiguous results. When the structures are correctly classified but there are problems in the structural alignments, the trained HMMs will give better results. The trained HMMs are used to allow some revision of the initial structural alignment, but they are of course statistically biased. The diagnostic feature of the trained HMMs is that if the target sequence follows a similar progression of scores through the HMMs generated during the training process then it may well have similar structural behavior to the title protein despite alignment ambiguities obscuring the signal from the HMMs.
Further exploration of the value of our program was done on complex proteins for which structures are not yet available. Figure 3 shows results obtained using HMMSPECTR for structure prediction of cystic fibrosis transmembrane regulator (CFTR). The program predicted proteins consistent with the known structural domains of CFTR. There are two of the important functional domains of CFTR: NBD-1 and NBD-2 (first and second nucleotide-binding domains). HMMSPECTR predicted a correspondence between the NBD-1 region of CFTR and the tertiary structure of 2AY5 (aromatic amino acid aminotransferase). Following this unpublished prediction, the structure of part of ABC transporter protein (ATB-binding subunit of histidine permease) was solved in the laboratory of Sung-ho Kim at UC Berkeley (Hung et al., 1998
). This molecule has significant homology to NBD-1 of CFTR. The structure of ABC transporter was not present in the PDB and was not used in our preparation of initial HMMs. Nevertheless, when we received it directly from Dr Kim we constructed on its basis the homology model of NMD-1 of CFTR and then superimposed it with the tertiary structure of 2AY5.
|
|
|
|
|
![]() |
Notes |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
CASP 4 (2000). Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction, Asilomar, CA.
Eddy, S (1998) Bioinformatics, 14, 755763.[Abstract]
Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997). Biochem. Biophys. Res. Commun., 231, 760766.[CrossRef][ISI][Medline]
Hung, L.W., Wang, I.X., Nikaido, K., Liu, P.Q., Ames,G.F. and Kim,S.H. (1998)Nature, 396, 703707.[CrossRef][ISI][Medline]
Kabsch,W. and Sander,C. (1983) Biopolymers, 12, 25772637.
Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846856[Abstract]
Laurents,D.V., Subbiah,S. and Levitt,M. (1994). Protein Sci., 11, 19381944.
Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Orengo,C., Jones,D.T. and Thornton,J.M. (1994). Nature, 372, 631634.[CrossRef][ISI][Medline]
Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739747.[Abstract]
Tsigelny,I., Shindyalov,P.E., Bourne, T.C., Sudhoff,T.C. and Taylor, P. (2000) Protein Sci., 9, 180185.[Abstract]
Received July 27, 2001; revised January 2, 2002; accepted February 8, 2002.