Hidden Markov Models-based system (HMMSPECTR) for detecting structural homologies on the basis of sequential information

Igor Tsigelny1,2,3,4, Yuriy Sharikov2 and Lynn F. Ten Eyck1,2,3

1 Department of Chemistry and Biochemistry 0654 and 2 San Diego Supercomputer Center 0505, 3 Department of Pharmacology, University of California, San Diego, La Jolla, CA 92093, USA


    Abstract
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
HMMSPECTR is a tool for finding putative structural homologs for proteins with known primary sequences. HMMSPECTR contains four major components: a data warehouse with the hidden Markov models (HMM) and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 10–15 `best' proteins from the chosen HMMs. The data warehouse contains four libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER program. The third library contains parts (`partial HMM') of initial alignments. The fourth library contains trained HMMs. We tested our program against all of the protein targets proposed in the CASP4 competition. The data warehouse included libraries of structural alignments and HMMs constructed on the basis of proteins publicly available in the Protein Data Bank before the CASP4 meeting. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases. The improvement is most notable for the targets with complexity 4 (difficult fold recognition cases).

Keywords: Hidden Markov models/HMM/protein structure prediction/secondary structure/structural alignments/tertiary structure


    Introduction
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Hidden Markov models (HMM)s have become a regular tool for many tasks in the field of bioinformatics. Profile HMM methods are increasingly used in the area of protein structure prediction. It is known that more than 80% of new protein structures with relatively small sequence similarity to solved structures nevertheless adopt an already known protein fold (Orengo et al., 1994Go; Eddy, 1998Go). It is also known that in many cases pairs of proteins can have very good structural alignments and have less than 15% sequence identity (Laurents et al., 1994Go). This makes possible the development of a program that would use structural information hidden in protein folds for finding putative structures of proteins on the basis of their primary sequences. It has been shown (Baldi et al., 1994Go) that HMMs can serve to model families of biological sequences. HMMs have been used for analysis of specific motifs in protein families (Grundy et al., 1997Go). Current methods of HMM search for distant homologs of proteins are based on a set of pairwise alignments of protein sequences to a query sequence. For example, the SAM program (Karplus et al., 1998Go) uses a BLAST search with the initial sequence to produce the sets of potential homologs, which are then used to construct corresponding HMMs.

This paper describes a program that we have developed, HMMSPECTR, that finds putative structural homologs for proteins with known primary sequences. The foundation of HMMSPECTR is the hypothesis that the structural information in protein sequences can be extracted from structural alignments.


    Methods
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
HMMSPECTR contains four major components: a data warehouse with the HMM and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of 10–15 `best' proteins from the chosen HMMs (Figure 1Go).



View larger version (27K):
[in this window]
[in a new window]
 
Fig. 1. HMMSPECTR contains four major components: a data warehouse with the HMM and alignment libraries; a search program which compares the initial protein sequences with the libraries of HMMs; a secondary structure prediction and comparison program; and a dominant protein selection program that prepares the set of the `best' proteins from the chosen HMMs.

 
Data warehouse construction

The goal of the data warehouse construction was to cover the majority of known structures of proteins. The data warehouse of fold superfamilies was constructed using the SCOP fold classification (Murzin et al., 1995Go). We created and trained a set of HMMs using the program HAMMER (Eddy, 1998Go). From each fold of SCOP we selected a typical representative: a title protein. The CE program (Shindyalov and Bourne, 1998Go) was used to create structural alignments of proteins that have tertiary structures close to the title protein. We considered proteins structurally close if the Z-score reported by CE was >4. Multiple pairwise alignments constructed using CE contained 50–800 proteins. The number of proteins in each alignment depends on the Z-score chosen to limit similarity of these proteins to the title protein. Constructed alignments had to include all members of the selected SCOP superfamily. In many cases we chose relatively loose Z-score cut-offs (<4) to obtain multiple alignments that included some proteins with structures sufficiently close to the title protein but not included in this SCOP superfamily or family (Figure 2Go). This feature was introduced to build statistically rich HMMs. Use of a narrow set of proteins structurally very close to each other in the initial HMM restricts a major advantage of the HMM approach, i.e. estimation of probabilities of transitions between neighboring amino acids. Our goal thus became finding representatives of a specific fold or set of folds instead of finding representatives of a specific family of proteins (Figure 2Go). We created the HMM corresponding to each set of structural alignments.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2. Simplified presentation of covering full space of possible protein folds by loose (a) and narrow (b) sets of structurally related proteins. In the case of loose Z-score cut-offs (<4) obtained structural alignments included some proteins with the structures sufficiently close to the title protein but not included in specific SCOP superfamily or family.

 
We created structural alignments having as a core each superfamily for the main classes of folds: all alpha proteins ({alpha}), all beta proteins (ß), alpha and beta proteins ({alpha}/ß), alpha and beta proteins ({alpha} + ß), multi-domain proteins ({alpha} and ß), coiled coil proteins and `small proteins'. We also created alignments for specific families, including EF hand-like ({alpha}), PHGase F-like (ß), supersandwich (ß), NAD(P)-binding Rossman-fold domains ({alpha}/ß), thioredoxin fold ({alpha}/ß), pyruvate–ferredoxin oxidoreductase (PFOR) domain III ({alpha}/ß), IL8-like ({alpha} + ß) and zincin-like ({alpha} + ß). For the globin-like fold ({alpha}) we created alignments for all protein domains in two families, globins and phycocyanins. This level of detail was needed to cover all SCOP proteins of specific subdivisions by alignments. The number of structural alignments created was 1500.

We created three libraries of HMMs. The first two libraries were constructed using different HHM preparation options of the HAMMER package and the third library contained parts (`partial HMM') of initial alignments. The first library included variants of HMM preparation with different gap-filter values from 0.1 to 0.9. The second library contained trained HMMs. The cyclic HMM training was done by using the initial HMM to create the next multiple alignment, which in turn wss used to prepare the HMM for the next step (Tsigelny et al., 2000Go). This procedure converged in 3–5 cycles. The search procedure then selected HMMs for which the score for specific target sequences grew during the training. In many cases, training increased the score with which these HMMs made specific predictions. The third library consisted of `partial HMMs', based on the observation of significant discontinuities in both the CE Z-score and the HMM scores for members of a family. `Partial HMMs' were obtained by splitting the family at the discontinuities.

Search procedure

Each HMM from the data warehouse (including trained and untrained HMMs) is tested for concordance with the probe sequence. If the system is not able to pick one with a reasonably high score even using trained HMMs, it shifts to the search of partial HMMs. Eventually it stops when the highest score is found.

To select the best final solution we compare the secondary structure of the best 10 candidates extracted from DSSP library (Kabsch and Sander, 1983Go) with the predicted secondary structure of the target protein. For secondary structure prediction we used a new method based on pattern recognition techniques (in preparation).

Table IGo illustrates effectiveness of our HMM training procedures on CASP 4 protein targets T0109, T0100 and T0087. The training procedures significantly increase the scores and, even more important, the length of predicted protein structures. We have to note that training does not improve the scores and lengths in all cases. In a number of cases we do not see any improvement. This usually means that the initial HMM is prepared properly and does not need further correction.


View this table:
[in this window]
[in a new window]
 
Table I. Scores for finding of proteins targets of CASP 4 using their sequences on the entrance to HMMSPECTR 1.02
 

    Results and discussion
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
We tested our program against all of the protein targets proposed in CASP4 competition (CASP 4, 2000Go). The data warehouse included libraries of structural alignment and HMMs constructed on the basis of proteins publicly available in the Protein Data Bank before the CASP4 meeting. An earlier version of the methods described here was actually entered in the CASP4 experiment, where it produced the medium results. On examination of these results we saw that they were based on only the first of the models that we postulated, but in many cases we had provided other models that were clearly (with hindsight) better. The comparison of secondary structure prediction with the secondary structure of the model provided much improvement. It should be noted that although we took all precautions to avoid data contamination, this was not a true blind test, but is the most comprehensive we can provide at this time.

The following changes were made to the program after CASP-4 meeting:

  1. Fundamental changes were made for the `partial hmm' library construction. This library is used only when the other prediction scores are weak. In the construction of `partial HMMs' we used sorting by `HMM-score' (alignment score between the consensus sequence of an HMM and each of the proteins in the alignment) instead of our previous initial sorting by CE Z-score (every protein versus the superfamily representative `title protein').The family is partitioned by sharp changes of HMM-scores. The range of CE Z-scores 3–7 is much less reliable for finding changes than HMM-scores having much broader boundaries, say from -200 to +200.
  2. The number of g-filters used was increased from the set of 0.4, 05, 0.6 to 0.1, 0.3, 0.4, 0.5, 0.6, 0.7, 0.9. The results derived using all these filters are now stored in the data warehouse.
  3. We improved our scoring function by introducing dependence on the length of predicted protein sequence into the score function.
  4. Secondary structure prediction was introduced into the program. Currently the final score of the predicted protein structure is defined by both the HMM-score and the secondary structure correspondence score. Reliability of the secondary structure prediction is also incorporated in the score function.

The results of our tests on the CASP4 targets are shown in Tables II and IIIGoGo. The newest fully automated versions of HMMSPECTR 1.02 and 1.02ss produced better results than the best result reported at CASP4 either by r.m.s.d. or by length (or both) in 64% (HMMSPECTR 1.02) and 79% (HMMSPECTR 1.02ss) of the cases.


View this table:
[in this window]
[in a new window]
 
Table II. Results of protein structure prediction of CASP4 targets
 

View this table:
[in this window]
[in a new window]
 
Table III. Comparison of HMMSPECTR results with CASP 4 predictions
 
Table IIIGo shows the effect of the complexity of targets on the results of HMMSPECTR. The 1–5 scale of complexity of targets (CASP 4, 2000) is used for these calculations. One can see that there is some improvement of results for the simple targets with complexity 1 and 2. These simple `homology modeling' targets are readily solved by many sequence-based methods. For the targets with complexity 3 we see improvements in length and r.m.s.d. For the targets with complexity 4 this improvement of both parameters is most profound. In the targets with the highest complexity, 5, predictions are also improved but less than in the best case of complexity 4.

The details of our current protein structure prediction strategy using HMM score and Secondary Structure Prediction score are given using as an example CASP-4 target T89, (PDB code 1E4F, cell division protein FtsA from Thermotoga maritima).

Primary HMM search brought the following best results from three libraries:

In the pre-CASP-4 period we would just use the best prediction, 1QHA:B(917:80–462). In the new version of the program g-library with increased number of filters selected 1HLU:A as a best prediction. Nevertheless, the best prediction by HMM-score would still remain 1QHA:B. Only taking in consideration the Secondary Structure Prediction Score makes it possible to predict in fully automated mode the right structure 1HLU:A:

The final prediction is made using the minimum sum of sorting scores of both HMM and SS predictions.

The final prediction of the tertiary structure for target T89 is protein 1HLU_A.

H2M = HMMscorexLr/Lt for positive HMM scores; H2M = HMMscorexLt/Lr for negative HMM scores, where Lt = length of target protein sequence. In the case when Lr = Lt, H2M = HMMscore. Secondary structure score is calculated starting from adding 1 for the first identical letters of secondary structure of a target and predicted secondary structure, adding 0.1 for each next non-interrupted identical letter and subtracting 0.1 for each gap. The sum is multiplied by the coefficient of reliability of prediction in each case, Kpr, which has values from 0.1 to 0.9. The resulting score if also multiplied by the coefficient taking into account the length of the region of correspondence.

HMMSPECTR is successful because it explicitly allows for two of the basic problems of predicting new structures from libraries of known structures. The basic assumptions in this process are that (1) the structures are properly classified and (2) they are properly aligned. When this is the case, the normally prepared HMMs in the data warehouse correctly classify target sequences to the appropriate folds. However, the classification of structures may not be sufficiently detailed to reflect the true sequence to structure code for some folds. In this case, the `partial HMMs' subdivide the classifications based on discontinuities of similarity scores in the originally classified data. This weakens the statistical power of the HMMs, so this method is used only when the other methods have returned ambiguous results. When the structures are correctly classified but there are problems in the structural alignments, the trained HMMs will give better results. The trained HMMs are used to allow some revision of the initial structural alignment, but they are of course statistically biased. The diagnostic feature of the trained HMMs is that if the target sequence follows a similar progression of scores through the HMMs generated during the training process then it may well have similar structural behavior to the title protein despite alignment ambiguities obscuring the signal from the HMMs.

Further exploration of the value of our program was done on complex proteins for which structures are not yet available. Figure 3Go shows results obtained using HMMSPECTR for structure prediction of cystic fibrosis transmembrane regulator (CFTR). The program predicted proteins consistent with the known structural domains of CFTR. There are two of the important functional domains of CFTR: NBD-1 and NBD-2 (first and second nucleotide-binding domains). HMMSPECTR predicted a correspondence between the NBD-1 region of CFTR and the tertiary structure of 2AY5 (aromatic amino acid aminotransferase). Following this unpublished prediction, the structure of part of ABC transporter protein (ATB-binding subunit of histidine permease) was solved in the laboratory of Sung-ho Kim at UC Berkeley (Hung et al., 1998Go). This molecule has significant homology to NBD-1 of CFTR. The structure of ABC transporter was not present in the PDB and was not used in our preparation of initial HMMs. Nevertheless, when we received it directly from Dr Kim we constructed on its basis the homology model of NMD-1 of CFTR and then superimposed it with the tertiary structure of 2AY5.



View larger version (35K):
[in this window]
[in a new window]
 
Fig. 3. Results obtained using HMMSPECTR for structure prediction of different domains of cystic fibrosis transmembrane regulator (CFTR). The program predicted proteins consistent with the known structural domains of CFTR: 2AY5 aromatic amino acid aminotransferase, 1FIE recombinant human coagulation factor Xiii, 1ILE isoleucyl-tRNA synthetase, 1QGR importin ß bound to the Ibb domain of importin {alpha}, 16VP conserved core of the herpes simplex virus transcriptional regulatory protein Vp16, 1AMU phenylalanine activating domain of gramicidin synthetase 1, 1BZY human hgprtase, 1TUB tubulin {alpha}–ß dimer.

 
Figure 4Go shows the superimposition of the two proteins 2AY5 and ABC transporter. All four helices of both molecules are nicely superimposed on each other. Moreover, three ß-strands in the region between helices also have close positioning. There are inserts in each molecule (lines) that are not superimposable, but these inserts do not compromise overall striking correspondence of the two structures.



View larger version (60K):
[in this window]
[in a new window]
 
Fig. 4. Superimposition of two proteins, 2AY5 and ABC transporter. Protein 2AY5 represents a tertiary structure predicted by HMMSPECTR on the basis of CFTR nucleotide-binding domain sequence. ABC transporter is known to correspond structurally to CFTR. Dark ribbon, CFTR; light gray ribbon, AY5.

 


View this table:
[in this window]
[in a new window]
 
PRIMARY HMM-SEARCH:
 

View this table:
[in this window]
[in a new window]
 
SORTED BY SIMILARITY OF SECONDARY STRUCTURES
 

View this table:
[in this window]
[in a new window]
 
SORTED BY HMM-SPECTR:
 

    Notes
 
4 To whom the correspondence should be addressed. E-mail: itsigeln{at}ucsd.edu Back


    References
 Top
 Abstract
 Introduction
 Methods
 Results and discussion
 References
 
Baldi,P., Chuvin,Y., Hunkapiller,T. and McClure,M.A. (1994) Proc. Natl Acad. Sci. USA, 91, 1059–1063[Abstract]

CASP 4 (2000). Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction, Asilomar, CA.

Eddy, S (1998) Bioinformatics, 14, 755–763.[Abstract]

Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997). Biochem. Biophys. Res. Commun., 231, 760–766.[CrossRef][ISI][Medline]

Hung, L.W., Wang, I.X., Nikaido, K., Liu, P.Q., Ames,G.F. and Kim,S.H. (1998)Nature, 396, 703–707.[CrossRef][ISI][Medline]

Kabsch,W. and Sander,C. (1983) Biopolymers, 12, 2577–2637.

Karplus,K., Barrett,C. and Hughey,R. (1998) Bioinformatics, 14, 846–856[Abstract]

Laurents,D.V., Subbiah,S. and Levitt,M. (1994). Protein Sci., 11, 1938–1944.

Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Orengo,C., Jones,D.T. and Thornton,J.M. (1994). Nature, 372, 631–634.[CrossRef][ISI][Medline]

Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739–747.[Abstract]

Tsigelny,I., Shindyalov,P.E., Bourne, T.C., Sudhoff,T.C. and Taylor, P. (2000) Protein Sci., 9, 180–185.[Abstract]

Received July 27, 2001; revised January 2, 2002; accepted February 8, 2002.





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Request Permissions
Google Scholar
Articles by Tsigelny, I.
Articles by Ten Eyck, L. F.
PubMed
PubMed Citation
Articles by Tsigelny, I.
Articles by Ten Eyck, L. F.