1Serono Reproductive Biology Institute, One Technology Place, Rockland, MA 02370, 2Department of Biology, MIT, Mass Avenue, Cambridge, MA 02139 and 3BMERC, Boston University, 36 Cummington Street, Boston, MA 02215, USA
4 To whom correspondence should be addressed. e-mail: Jadwiga.Bienkowska{at}serono.com
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: homologs/search method/small proteins/ubiquitin-like proteins
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For small proteins, about 100 amino acids long, the statistical significance of sequence similarity is limited by the large size of the protein database and the sequences short length. The expectation (e-value) for finding a random sequence match in a database takes into account the size of the database and for short sequences (small proteins) the probability of a random match is quite high. A BLAST search (with default parameters) for ubiquitins homologs in the microbial database at NCBI returns only insignificant hits with e-values >1.4 and with a growing database this number will increase. Homolog detection methods based on sequence similarity use search algorithms such as BLAST, FASTA or SmithWaterman that maximize the sequence similarity scores. The final output of a search is a set of optimal sequence alignments assessed by their statistical significance (Karlin and Altshul, 1993). The consequence of the score maximization is that often aligning sequences of proteins with dissimilar structures leads to sequence identities as high as 2530% (Abagyan and Batalov, 1997
). On the other hand, when similar structures of unrelated proteins are aligned the sequence identity calculated for the structural alignment is usually below 10%. Since structure is evolutionarily more conserved than sequence, a conservation of structurally equivalent residues obtained from a structural alignment is more meaningful than conservation of residues obtained from a sequence similarity alignment. For distant homologs the evolutionary relationship often cannot be confidently identified by sequence similarity searches but the structural alignment can show sequence conservations that support the hypothesis of the common evolutionary origin. Additional inspection of the alignment may indicate that functionally important residues are conserved, thus giving credence to the common origin hypothesis.
To address the challenge of the identification of homologs of small proteins we have designed a process of sequence information analysis that combines structure and sequence comparison methods. Figure 1 outlines the steps and the information flow in this process. This process maximizes the detection of proteins that likely have a function similar to the target protein family and minimizes the number of false positives in the final result. It starts from the identification of all candidate proteins that are likely similar by structure to the target family and proceeds by using sequence comparison methods to eliminate false positives and to identify similar protein sequences. The last step classifies identified candidate proteins into functional superfamily members and those that are just structural analogs. We have implemented this general approach by using specific methods for structure and sequence comparison: the Bayesian fold recognition (Bienkowska et al., 2000, 2001
) and the PSI-BLAST algorithms (Altschul et al., 1997
). The last classification step uses the CLUSTAL sequence alignment (Higgins et al., 1994
). The description of specific methods used in this investigation can be found in Materials and methods. However, other combinations of structure recognition, sequence similarity search and alignment methods could be used alternatively such as different fold recognition methods: GenThreader (Jones, 1999
), FFAS (Rychlewski et al., 2000
) and SAM99 (Karplus et al., 1999
). Other fold recognition methods can be found on the site of the most recent CASP5 protein structure prediction contest (CASP5).
|
The ubiquitin and ubiquitin-related proteins are essential for a number of cellular mechanisms (Hochstrasser, 2000). In its classical role, ubiquitin serves as a tag for proteins that are destined for degradation by proteosome. The covalent attachment of ubiquitin directs the modified protein to proteosome. This role appears exclusive to eukaryotes and is not known in Archaea or Bacteria. A related SUMO protein is used in yeast for tagging proteins and directing them to discrete sites within the nucleus. Like ubiquitin, SUMO has many diverse targets, among them transcription factors NF-kkB/rel and p53. Another ubiquitin-related protein RUB1 (related to ubiquitin 1) in yeast targets only one familythe cullins. Other proteins (like ribosomal protein L30) have a ubiquitin-like domain built into the protein structure itself, perhaps used to direct them to the ribosomal assembly. These observations suggest that ubiquitin-related tagging systems are a widely used mechanism for directing proteins to specific localization within the eukaryotic cell.
In Archaea and Bacteria the recently solved structures of the prokaryotic sulfur carrier proteins ThiS and MoaD exhibit a structural similarity to ubiquitin with a sequence identity with ubiquitin of only 14% according to the structural alignment (Rudolph et al., 2000; Wang et al., 2000
). The sulfur transfer mechanism used by ThiS and MoaD is also similar to that occurring during ubiquitination. Another protein from cyanobacterium Anabaena, HesB, has also been proposed as a ubiquitin analog since proteins encoded in the same operon are related to ubiquitin-activated enzyme and ferredoxin. However, HesB structure has not been determined and the only sequence-based support for this conjecture is the observation that in some species HesB terminates with GlyGly peptide as ubiquitin does (Hochstrasser, 2000
). The question is: are ThiS and MoaD proteins homologs or analogs of ubiquitin? Are there other ubiquitin-related proteins in Archaea and Bacteria that also play important roles in the organization of the prokaryotic cell? The goal of this investigation was to find as many as possible potentially ubiquitin-related proteins that cannot be easily identified by sequence similarity alone. Those proteins are candidates for investigating whether they have structures similar to ubiquitin and/or play similar cellular roles.
Given ubiquitins small size, it is difficult to detect distant homologs using sequence similarity searches alone. To search for new ubiquitin-related proteins we have applied a method outlined in Figure 1 combining fold recognition with sequence similarity. Our method concentrates on the detection of entire structural domains but not on the detection of a domain embedded in the context of a multi-domain protein. Thus, proteins that contain a ubiquitin-like domain among other components or poly-ubiquitin chains were not the subject of this study. Support for any implied homology was then investigated by looking for specific functional residue conservations implied by structural alignments.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Structural similarity
In this study, we have used a Bayesian fold recognition method (Bienkowska et al., 2000, 2001
) as the first step of the process shown in Figure 1. In this method, protein domain structures are modeled by discrete state-space models (DSMs) as described before (Bienkowska et al., 2000
, 2001
). DSMs are represented mathematically as hidden Markov models (Stultz et al., 1993
). A distinct DSM is automatically built for a homologous protein family, when at least one member of the family has a known three-dimensional structure. Such a model characterizes a structural fold common to a protein family including minor structural variation. The model also encodes a sequence pattern common to the whole family. The SCOP (Murzin et al., 1995
) functional domain classification of proteins is used for the identification of autonomous functional domains/families. Models of all functional families defined in SCOP constitute the DSM library of all known folds. We have used the 1.61 release of SCOP to construct the DSM library for all domains with <95% identities (Chandonia et al., 2002
). For the 1.61 release of SCOP the DSM library consists of 21 016 models. Those models represent 1517 families and 589 folds from SCOP. Our library of DSMs excludes PDB entries classified by SCOP as membrane, coiled-coil, peptides, low resolution or designed structures.
The search algorithm assesses the match between the query sequence and the model of the family by calculating the total probability of observing the sequence given a model. The total probability is not a maximum of some sequence-to-model similarity score but is simply a sum of probabilities of all sequence-to-model alignments. Previously, we have shown that the total alignment probability is a better measure of fold recognition than the optimal alignment probability (Bienkowska et al., 2000). In the case of DSMs, the optimal alignment is equivalent to a globally optimal alignment of a sequence to a model. Since a subset of the DSMs library encodes sequence profiles (Bienkowska et al., 2001
), the search against such models with the global optimal alignment probability as a similarity measure corresponds to the search of sequence profiles with a global alignment score.
The statistical significance of the match between the query protein sequence and a model is measured by the posterior probability of the model given the sequence relative to the large library of alternative models (Lathrop et al., 1998; Bienkowska et al., 2000
, 2001
). According to Bayes, the posterior model probability is given by the total probability of the sequence given the fold and the prior probability of the fold:
where Nf is the number of folds represented in the library. We select only the model with the highest likelihood score from each fold and assign the sequence probability given a fold model by:
where Mi are the models from the library. We assign fold priors uniformly over the folds. The total probability P(seq | Mi) is calculated using the filtering algorithm developed by Jim White (White, 1988). We use a binary decision approach and recognize the sequence as compatible with a structural fold if the posterior probability of the fold (Equation 1) is >0.5.
Sequence similarity
As a sequence comparison method we use the PSI-BLAST (Altschul et al., 1997). Each candidate homolog sequence is a query in the search of the database of SCOP sequences as provided by the ASTRAL database (Chandonia et al., 2002
). The search is run in two steps. First, the PSI-BLAST search is run for five iterations against the GenBank non-redundant protein database and the resulting position-specific scoring matrix (PSSM) is saved and used as an input in the search of the SCOP sequences, run for at most 10 iterations. The e-value cut-off for inclusion of sequence for PSSM generation is 1010. We identify as known homologs of a SCOP domain sequences similar to that domain with an e-value better than 103. If two or more SCOP domains with a different SCOP family assignment pass this criterion the sequence is designated as a homolog of a domain with the lower e-value.
Sequence clustering
As a method for clustering sequences into functional superfamily groups we use the CLUSTAL multiple alignment program (Higgins et al., 1994). First, a multiple alignment of known homologs, structural analogs and candidate sequences is generated using CLUSTAL. Secondly, we calculate the pairwise p-distances among the set of proteins and create a neighbor-joining tree using the MEGA2 software (Kumar et al., 2001
). We use the internal branch test of phylogeny to generate a robust clustering tree (Nei and Kumar, 2000
). Thirdly, we identify sub-trees that have as members only the same superfamily proteins among the known homologs and sequences from the new candidates set. The most distantly related known members of a superfamily delimit the superfamily sub-tree. The candidate sequences from the same superfamily subtree are designated as new members of the superfamily. We have also considered a bootstrap test for phylogeny to generate trees (Kumar et al., 2001
). We have found that the bootstrap method clusters only the sequences with high sequence similarity, the known superfamily members (Nei and Kumar, 2000
). Each superfamily set was split into many more subtrees and only three new sequences were clustered with the known members of superfamilies with the bootstrap test. Those sequences are indicated in Table I.
|
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Each of the 8976 sequences was tested for structural compatibility with the ubiquitin fold. The posterior probability of each fold was calculated and we have selected 134 proteins that had posterior probabilities >0.5, suggesting a ubiquitin fold. Following the process shown in Figure 1, we have identified 11 of those proteins as false positives since they were similar by sequence comparison to proteins with a different structure. This reflects about an 8% (11/134) false positives rate for the structure recognition algorithm used here. Since the initial set of 8976 query proteins has been selected blindly, regardless of whether a functional annotation for a gene was present or not, we believe it is a fair assessment of the false positive rate. Thirty three proteins were identified as sequence homologs of ubiquitin-like proteins, i.e. proteins that are known to have a similar structure. Among those there are 12 sequences similar to the MoaD/This superfamily, eight sequences similar to ubiquitin-like superfamily (not counting ubiquitin itself) and 13 similar to the 2Fe2S ferredoxin-like superfamily. The specific sequences and their SCOP family association through PSI-BLAST sequence similarity are listed in Table I.
The application of the process outlined in Figure 1 identified an additional 28 ubiquitin-like proteins. Figure 2 shows the subtrees corresponding to different superfamilies that were identified by the clustering procedure described in Materials and methods. The proteins that belong to each identified subtree are listed in Table I. Only one protein from C.elegans was classified into the ubiquitin-like superfamily. Twenty-one proteins were classified as members of the MoaD/ThiS superfamily. Seven proteins were classified as members of the 2Fe2S ferredoxins.
|
Figure 3 shows the alignment of MTH1743 (1jsb), aq_025a, AF0737, TVN0569, 1fmaD and ubiquitin (1ubi) as inferred from the structural alignment (Holm and Sander, 1996) between 1ubi, 1jsb, 1fmaD and the alignment of ThiS and MoaD families. We have used pairwise structural alignments between 1ubi1jsb, 1ubi1fmaD and 1jsb1fmAD and checked that they do agree among all three structures. These pairwise structural alignments were progressively combined to generate the alignment in Figure 3. We have checked that using an alternative method of structural alignment CE by Shindyalov and Bourne (Shindyalov and Bourne, 1998
) and using a structure of ThiS protein (1f0z) produced an identical alignment to that in Figure 3. The MTH1743, AF0737 and aq_025a proteins are, respectively, 19% (13/70), 14% (10/69) and 13% (9/67) identical to ubiquitin, where the numbers in parentheses indicate the number of identities and the length of the protein. The observation that prompted us to focus on those three genes was that ubiquitins Lys48 involved in the covalent assembly of poly-ubiquitin chains is in a similar three-dimensional position as the Lys42 in AF0737, Lys47 in MTH1743 (1jsb) and Lys38 in aq_025a genes (see Figure 4). Lys48 in ubiquitin occupies a structural position at the end of a two residue long tight turn between strands three and four. According to the alignment in Figure 3, Lys47 of 1jsb (and corresponding lysines in AF0737 and aq_025a) is positioned just at the start of the same tight turn. Among 24 members of COG02104, only AF0737, MTH1743 and aq_025a have Lys in this position. In the remaining 21 ThiS-COG sequences the closest lysine is at least five residues away. Lys42 in AF0737 is predicted to occupy a structural position that would permit the participation in the covalent bond analogous to the one observed in poly-ubiquitin (see Figure 4). Additionally, all three sequences have the terminal GlyGly peptide that is conserved in a number of ubiquitin-related proteins, where the activation of proteins occurs. However, the GlyGly conservation is observed in the whole ThiS family. A similar investigation of the MoaD family (COG1977) found that there is only one member of this family, the TVN0569 gene that has similar sequence features. The TVN0569 has only 11% (10/90) with ubiquitin and the Bayesian fold recognition method does not find it to be compatible with any known structural fold (including the MoaD structure). Moreover, the structure of MoaD itself exhibits a higher degree of structural difference with ubiquitin than the ThiS structure does (see Figure 3). Thus, the archaeal genes AF0737, MTH1743 and aq_025a bear a stronger overall resemblance to ubiquitin.
|
|
In the Introduction we have noted that discrimination between analogy and homology is a difficult problem. Thus, it is still possible that the AF0737, MTH1743 and aq_025a genes are just structural analogs of ubiquitin since they are identified as members of the ThiS family. However, our sequence and structure alignment analysis supports the hypothesis that AF0737, MTH1743 and aq_025a genes are the best candidates for ubiquitin homologs in prokaryotes.
![]() |
Conclusions |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Using our search method, in addition to known homologs, we have identified 90 small proteins from the sequenced genomes of Archaea, Bacteria, C.elegans and A.thaliana that are predicted to be similar in structure. Among those 90, we have identified 29 un-annotated proteins as belonging to various ubiquitin-like protein superfamilies. Three of those proteins are in eukaryotes, six are in Archaea and 20 are bacterial proteins. For those 29 proteins, a probable homolog cannot be identified by the sequence comparison. One C.elegans protein was classified as a member of the ubiquitin-like superfamily. Seven sequences were identified as 2Fe2S ferredoxins and 21 were identified as members of the MoaD/This superfamily. Further analysis of the structural alignments and alignments of MoaD and ThiS COGs lead us to the selection of three genes as coding for ubiquitin-related proteins. Given the conservation of the structural position of functionally important lysine, we propose that archeal proteins AF0737, MTH1743 and aq_025a are distant homologs of ubiquitin. Other proteins identified by this method are more likely to be just structural analogs. Given a low false positive rate (8%) of the Bayesian fold recognition method and further elimination of those, we believe that the newly identified set of distant homologs of ubiquitin-like, ferredoxin and MoaD/ThiS is a very promising set of candidate proteins for the experimental confirmation of predicted functions.
The method proposed here provides a new means for distant homolog identification where methods based only on sequence similarity do not apply. This method is not proposed as a replacement for the sequence similarity searches but is helpful in situations where no homolog can be found using sequence comparison, as is often the case for short proteins. The identification of the potential homolog provides a starting point for the further analysis of the sequence and structure derived conservation that may give additional evidence supporting the common ancestry hypothesis. Alignments to a homolog with a known structure need to be inspected for the conservation of the core structural elements. The sequence similarity that is inferred from the structural alignment additionally supports the functional inference. Even more meaningful is the conserved three-dimensional location of functionally important residues. The method presented here can be applied to re-examine homology among many proteins shorter than 100 amino acids. For such short proteins the standard sequence similarity analysis often does not provide statistically significant predictions to assign homology.
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Bienkowska,J.R., Yu,L., Rogers,R.G.,Jr, Zarakhovich,S. and Smith,T.F. (2000) Proteins, 40, 451462.[CrossRef][ISI][Medline]
Bienkowska,J.R., He,H. and Smith,T.F. (2001) IEEE Intell. Syst. Biol., 16, 2125.
5 CASP5 (2002) Critical Assessment of Protein Structure Prediction 5, Asilomar Conference Center, December 2002, Asilomar, CA USA. http://predictioncenter.llnl.gov/casp5/
Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2002) Nucleic Acids Res., 30, 260263.
Hadley,C. and Jones,D.T. (1999) Structure, 7, 10991112[ISI][Medline]
Higgins,D., Thompson,J., Gibson,T., Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.[Abstract]
Hochstrasser,M. (2000) Nat. Cell Biol., 2, 153157.[CrossRef]
Holm,L. and Sander,C. (1996) Science, 273, 595560.
Jones,D.T. (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol., 287, 797815.[CrossRef][ISI][Medline]
Karlin,S. and Altschul,S.F. (1993) Proc. Natl Acad. Sci USA, 90, 58735877.[Abstract]
Karplus,K. et al. (1999) Predicting protein structure using only sequence information. Proteins, Suppl. 3, 121125.[Medline]
Kumar,S., Tamura,K., Jakobsen,I.B. and Nei,M. (2001) Bioinformatics, 17, 12441245.
Lathrop,R.H. et al. (1998) A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment. Bull. Math. Biol., 60, 10391071.[ISI][Medline]
Murzin,A., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536540.[CrossRef][ISI][Medline]
Nei,M. and Kumar,S. (2000) Evolution and Phylogenetics. Oxford University Press, New York.
Rudolph,M.J., Wuebbens,M.M., Rajagolpalan,K.V. and Schindelin,H. (2000) Nat. Struct. Biol., 8, 42.[CrossRef][ISI]
Rychlewski,L. et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232241.[Abstract]
Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739747.[CrossRef][ISI][Medline]
Stultz,C.M., White,J.V. and Smith,T.F. (1993) Protein Sci., 2, 305314.
Tatusov,R.L. et al. (2001) Nucleic Acids Res., 29, 2228.
Wang,C., Xi,J., Begley,T.P. and Nicholson,L.K. (2000) Nat. Struct. Biol., 8, 47.[CrossRef][ISI]
White,J.V. (1988) In Bayesian Analysis of Time Series and Dynamic Models. Marcel Dekker, New York, pp. 255283.
Yee,A. et al. (2002) Proc. Natl Acad. Sci. USA, 99, 1825.
Received December 1, 2002; revised October 22, 2003; accepted October 24, 2003