A search method for homologs of small proteins. Ubiquitin-like proteins in prokaryotic cells?

Jadwiga R. Bienkowska1,3,4, Hyman Hartman2 and Temple F. Smith3

1Serono Reproductive Biology Institute, One Technology Place, Rockland, MA 02370, 2Department of Biology, MIT, Mass Avenue, Cambridge, MA 02139 and 3BMERC, Boston University, 36 Cummington Street, Boston, MA 02215, USA

4 To whom correspondence should be addressed. e-mail: Jadwiga.Bienkowska{at}serono.com


    Abstract
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
The question of protein homology versus analogy arises when proteins share a common function or a common structural fold without any statistically significant amino acid sequence similarity. Even though two or more proteins do not have similar sequences but share a common fold and the same or closely related function, they are assumed to be homologs, descendant from a common ancestor. The problem of homolog identification is compounded in the case of proteins of 100 or less amino acids. This is due to a limited number of basic single domain folds and to a likelihood of identifying by chance sequence similarity. The latter arises from two conditions: first, any search of the currently very large protein database is likely to identify short regions of chance match; secondly, a direct sequence comparison among a small set of short proteins sharing a similar fold can detect many similar patterns of hydrophobicity even if proteins do not descend from a common ancestor. In an effort to identify distant homologs of the many ubiquitin proteins, we have developed a combined structure and sequence similarity approach that attempts to overcome the above limitations of homolog identification. This approach results in the identification of 90 probable ubiquitin-related proteins, including examples from the two prokaryotic domains of life, Archaea and Bacteria.

Keywords: homologs/search method/small proteins/ubiquitin-like proteins


    Introduction
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
The common ancestry of proteins has traditionally been proposed for those showing significant amino acid sequence similarity. The key issue is the definition of a ‘significant similarity’ which is usually defined in terms of the probability of finding by chance a sequence with the same or greater degree of similarity. In cases where the chance probability is high and the function and structure are similar, homology or common ancestry is still assumed. For many relatively short proteins of 100 amino acids or less, significant sequence similarity often becomes problematic since generally we do not have complete experimentally validated structural or functional information. For proteins with a catalytic function defined by an active site, only a minimal overall sequence similarity is required to support homology if it is associated with a conservation of a unique set of key amino acids embedded in a similar sequence context, such as a general hydrophobicity pattern. However, for short proteins having a less clearly defined functional or active site, the problem remains. What further complicates the homology assessment for short proteins is the fact that there is a limited number of single domain structural folds available. Thus, one can anticipate many non-homologs to show general sequence pattern similarities. For globular soluble proteins these similarities include patterns of hydrophobicity and spacing between the amino acids with high propensity for turns. Thus, without a full structural and functional data input, the identification of distantly related proteins of short length is still a major challenge. Here we present a method that attempts to overcome these limitations and its application to the search for probable homologs of a well studied family, the ubiqutins.

For small proteins, about 100 amino acids long, the statistical significance of sequence similarity is limited by the large size of the protein database and the sequence’s short length. The expectation (e-value) for finding a random sequence match in a database takes into account the size of the database and for short sequences (small proteins) the probability of a random match is quite high. A BLAST search (with default parameters) for ubiquitin’s homologs in the microbial database at NCBI returns only insignificant hits with e-values >1.4 and with a growing database this number will increase. Homolog detection methods based on sequence similarity use search algorithms such as BLAST, FASTA or Smith–Waterman that maximize the sequence similarity scores. The final output of a search is a set of optimal sequence alignments assessed by their statistical significance (Karlin and Altshul, 1993Go). The consequence of the score maximization is that often aligning sequences of proteins with dissimilar structures leads to sequence identities as high as 25–30% (Abagyan and Batalov, 1997Go). On the other hand, when similar structures of unrelated proteins are aligned the sequence identity calculated for the structural alignment is usually below 10%. Since structure is evolutionarily more conserved than sequence, a conservation of structurally equivalent residues obtained from a structural alignment is more meaningful than conservation of residues obtained from a sequence similarity alignment. For distant homologs the evolutionary relationship often cannot be confidently identified by sequence similarity searches but the structural alignment can show sequence conservations that support the hypothesis of the common evolutionary origin. Additional inspection of the alignment may indicate that functionally important residues are conserved, thus giving credence to the common origin hypothesis.

To address the challenge of the identification of homologs of small proteins we have designed a process of sequence information analysis that combines structure and sequence comparison methods. Figure 1 outlines the steps and the information flow in this process. This process maximizes the detection of proteins that likely have a function similar to the target protein family and minimizes the number of false positives in the final result. It starts from the identification of all candidate proteins that are likely similar by structure to the target family and proceeds by using sequence comparison methods to eliminate false positives and to identify similar protein sequences. The last step classifies identified candidate proteins into functional superfamily members and those that are just structural analogs. We have implemented this general approach by using specific methods for structure and sequence comparison: the Bayesian fold recognition (Bienkowska et al., 2000Go, 2001Go) and the PSI-BLAST algorithms (Altschul et al., 1997Go). The last classification step uses the CLUSTAL sequence alignment (Higgins et al., 1994Go). The description of specific methods used in this investigation can be found in Materials and methods. However, other combinations of structure recognition, sequence similarity search and alignment methods could be used alternatively such as different fold recognition methods: GenThreader (Jones, 1999Go), FFAS (Rychlewski et al., 2000Go) and SAM99 (Karplus et al., 1999Go). Other fold recognition methods can be found on the site of the most recent CASP5 protein structure prediction contest (CASP5).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1. Flow chart of the process of information collection and analysis. Thick square frames indicate three major steps in the process: (1) evaluating sequence compatibility with the target structure; (2) evaluating sequence similarity with functional domains; and (3) clustering known homologs, analogs and new sequences into sequence similarity groups. Numbers of sequences considered during the search for ubiquitin-like proteins are indicated in parentheses. The ubiquitin sequence was not counted.

 
Ubiquitin is a small protein with 76 amino acids (not including the pre-protein peptide) in its core polypeptide chain. It is involved in the degradation of proteins in eukaryotic cells. Ubiquitin’s structure has a rather common architecture of a small alpha and beta roll, with a five-strand beta sheet wrapped around a 12 residue long alpha helix. The ubiquitin sequence is highly conserved from fungi to mammals but ubiquitin itself is found only in eukaryotic cells (Eukarya) and not in prokaryotic cells (Archaea and Bacteria). There are, however, proteins found in Archaea and Bacteria that have a three-dimensional structural fold similar to ubiquitin (Hochstrasser, 2000Go; Rudolph et al., 2000Go; Wang et al., 2000Go).

The ubiquitin and ubiquitin-related proteins are essential for a number of cellular mechanisms (Hochstrasser, 2000Go). In its classical role, ubiquitin serves as a tag for proteins that are destined for degradation by proteosome. The covalent attachment of ubiquitin directs the modified protein to proteosome. This role appears exclusive to eukaryotes and is not known in Archaea or Bacteria. A related SUMO protein is used in yeast for tagging proteins and directing them to discrete sites within the nucleus. Like ubiquitin, SUMO has many diverse targets, among them transcription factors NF-kkB/rel and p53. Another ubiquitin-related protein RUB1 (related to ubiquitin 1) in yeast targets only one family—the cullins. Other proteins (like ribosomal protein L30) have a ubiquitin-like domain built into the protein structure itself, perhaps used to direct them to the ribosomal assembly. These observations suggest that ubiquitin-related tagging systems are a widely used mechanism for directing proteins to specific localization within the eukaryotic cell.

In Archaea and Bacteria the recently solved structures of the prokaryotic sulfur carrier proteins ThiS and MoaD exhibit a structural similarity to ubiquitin with a sequence identity with ubiquitin of only 14% according to the structural alignment (Rudolph et al., 2000Go; Wang et al., 2000Go). The sulfur transfer mechanism used by ThiS and MoaD is also similar to that occurring during ubiquitination. Another protein from cyanobacterium Anabaena, HesB, has also been proposed as a ubiquitin analog since proteins encoded in the same operon are related to ubiquitin-activated enzyme and ferredoxin. However, HesB structure has not been determined and the only sequence-based support for this conjecture is the observation that in some species HesB terminates with Gly–Gly peptide as ubiquitin does (Hochstrasser, 2000Go). The question is: are ThiS and MoaD proteins homologs or analogs of ubiquitin? Are there other ubiquitin-related proteins in Archaea and Bacteria that also play important roles in the organization of the prokaryotic cell? The goal of this investigation was to find as many as possible potentially ubiquitin-related proteins that cannot be easily identified by sequence similarity alone. Those proteins are candidates for investigating whether they have structures similar to ubiquitin and/or play similar cellular roles.

Given ubiquitin’s small size, it is difficult to detect distant homologs using sequence similarity searches alone. To search for new ubiquitin-related proteins we have applied a method outlined in Figure 1 combining fold recognition with sequence similarity. Our method concentrates on the detection of entire structural domains but not on the detection of a domain embedded in the context of a multi-domain protein. Thus, proteins that contain a ubiquitin-like domain among other components or poly-ubiquitin chains were not the subject of this study. Support for any implied homology was then investigated by looking for specific functional residue conservations implied by structural alignments.


    Materials and methods
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
In this section, we describe the specific structure similarity, sequence similarity and sequence clustering methods used here as the three major steps in the information analysis process described in Figure 1.

Structural similarity

In this study, we have used a Bayesian fold recognition method (Bienkowska et al., 2000Go, 2001Go) as the first step of the process shown in Figure 1. In this method, protein domain structures are modeled by discrete state-space models (DSMs) as described before (Bienkowska et al., 2000Go, 2001Go). DSMs are represented mathematically as hidden Markov models (Stultz et al., 1993Go). A distinct DSM is automatically built for a homologous protein family, when at least one member of the family has a known three-dimensional structure. Such a model characterizes a structural fold common to a protein family including minor structural variation. The model also encodes a sequence pattern common to the whole family. The SCOP (Murzin et al., 1995Go) functional domain classification of proteins is used for the identification of autonomous functional domains/families. Models of all functional families defined in SCOP constitute the DSM library of all known folds. We have used the 1.61 release of SCOP to construct the DSM library for all domains with <95% identities (Chandonia et al., 2002Go). For the 1.61 release of SCOP the DSM library consists of 21 016 models. Those models represent 1517 families and 589 folds from SCOP. Our library of DSMs excludes PDB entries classified by SCOP as membrane, coiled-coil, peptides, low resolution or designed structures.

The search algorithm assesses the match between the query sequence and the model of the family by calculating the total probability of observing the sequence given a model. The total probability is not a maximum of some sequence-to-model similarity score but is simply a sum of probabilities of all sequence-to-model alignments. Previously, we have shown that the total alignment probability is a better measure of fold recognition than the optimal alignment probability (Bienkowska et al., 2000Go). In the case of DSMs, the optimal alignment is equivalent to a globally optimal alignment of a sequence to a model. Since a subset of the DSMs library encodes sequence profiles (Bienkowska et al., 2001Go), the search against such models with the global optimal alignment probability as a similarity measure corresponds to the search of sequence profiles with a global alignment score.

The statistical significance of the match between the query protein sequence and a model is measured by the posterior probability of the model given the sequence relative to the large library of alternative models (Lathrop et al., 1998Go; Bienkowska et al., 2000Go, 2001Go). According to Bayes, the posterior model probability is given by the total probability of the sequence given the fold and the prior probability of the fold:

where Nf is the number of folds represented in the library. We select only the model with the highest likelihood score from each fold and assign the sequence probability given a fold model by:

where Mi are the models from the library. We assign fold priors uniformly over the folds. The total probability P(seq | Mi) is calculated using the filtering algorithm developed by Jim White (White, 1988Go). We use a binary decision approach and recognize the sequence as compatible with a structural fold if the posterior probability of the fold (Equation 1) is >0.5.

Sequence similarity

As a sequence comparison method we use the PSI-BLAST (Altschul et al., 1997Go). Each candidate homolog sequence is a query in the search of the database of SCOP sequences as provided by the ASTRAL database (Chandonia et al., 2002Go). The search is run in two steps. First, the PSI-BLAST search is run for five iterations against the GenBank non-redundant protein database and the resulting position-specific scoring matrix (PSSM) is saved and used as an input in the search of the SCOP sequences, run for at most 10 iterations. The e-value cut-off for inclusion of sequence for PSSM generation is 10–10. We identify as ‘known homologs’ of a SCOP domain sequences similar to that domain with an e-value better than 10–3. If two or more SCOP domains with a different SCOP family assignment pass this criterion the sequence is designated as a homolog of a domain with the lower e-value.

Sequence clustering

As a method for clustering sequences into functional superfamily groups we use the CLUSTAL multiple alignment program (Higgins et al., 1994Go). First, a multiple alignment of known homologs, structural analogs and candidate sequences is generated using CLUSTAL. Secondly, we calculate the pairwise p-distances among the set of proteins and create a neighbor-joining tree using the MEGA2 software (Kumar et al., 2001Go). We use the internal branch test of phylogeny to generate a robust clustering tree (Nei and Kumar, 2000Go). Thirdly, we identify sub-trees that have as members only the same superfamily proteins among the known homologs and sequences from the new candidates set. The most distantly related known members of a superfamily delimit the superfamily sub-tree. The candidate sequences from the same superfamily subtree are designated as new members of the superfamily. We have also considered a bootstrap test for phylogeny to generate trees (Kumar et al., 2001Go). We have found that the bootstrap method clusters only the sequences with high sequence similarity, the known superfamily members (Nei and Kumar, 2000Go). Each superfamily set was split into many more subtrees and only three new sequences were clustered with the known members of superfamilies with the bootstrap test. Those sequences are indicated in Table I.


View this table:
[in this window]
[in a new window]
 
Table I. The classification of proteins into different superfamilies with the ubiquitin-like fold using the method described in Figure 1
 

    Results and discussion
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
In the search for the ubiquitin-related proteins we have focused our attention on all archaeal, bacterial and eukaryotic sequences that range in length from 50 to 100 amino acids. We have collected 1467 archaeal, 6182 bacterial and 1327 genes from yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans) and plant (Arabidopsis thaliana). Our analysis is restricted to genes shorter than 100 amino acids, thus yeast genes YKR094C and YIL148W (137 amino acids long) coding for the ubiquitin pre-protein precursor as well as other poly-ubiquitin genes and multi-domain genes that are known to have ubiquitin-like domains were not the subject of our analysis.

Each of the 8976 sequences was tested for structural compatibility with the ubiquitin fold. The posterior probability of each fold was calculated and we have selected 134 proteins that had posterior probabilities >0.5, suggesting a ubiquitin fold. Following the process shown in Figure 1, we have identified 11 of those proteins as false positives since they were similar by sequence comparison to proteins with a different structure. This reflects about an 8% (11/134) false positives rate for the structure recognition algorithm used here. Since the initial set of 8976 query proteins has been selected blindly, regardless of whether a functional annotation for a gene was present or not, we believe it is a fair assessment of the false positive rate. Thirty three proteins were identified as sequence homologs of ubiquitin-like proteins, i.e. proteins that are known to have a similar structure. Among those there are 12 sequences similar to the MoaD/This superfamily, eight sequences similar to ubiquitin-like superfamily (not counting ubiquitin itself) and 13 similar to the 2Fe–2S ferredoxin-like superfamily. The specific sequences and their SCOP family association through PSI-BLAST sequence similarity are listed in Table I.

The application of the process outlined in Figure 1 identified an additional 28 ubiquitin-like proteins. Figure 2 shows the subtrees corresponding to different superfamilies that were identified by the clustering procedure described in Materials and methods. The proteins that belong to each identified subtree are listed in Table I. Only one protein from C.elegans was classified into the ubiquitin-like superfamily. Twenty-one proteins were classified as members of the MoaD/ThiS superfamily. Seven proteins were classified as members of the 2Fe–2S ferredoxins.



View larger version (102K):
[in this window]
[in a new window]
 
Fig. 2. The neighbor-joining tree of candidate ubiquitin-like proteins generated with the internal branch test. The MoaD/ThiS superfamily subtrees are highlighted in red, the ubiquitin-like superfamily is highlighted in blue and the 2Fe–2S ferredoxin superfamily is highlighted in yellow. For each SCOP superfamily, subtrees have the numbers assigned as in Table I. The non-highlighted sequences are structural analogs.

 
Our investigation did not find any proteins in Archaea or Bacteria classified as members of the ubiquitin-like superfamily. Thus, we have investigated the MoaD/This family members that were indicated before as possible ancestors of ubiquitin (Wang et al., 2000Go). The COGs COG1977 and COG2104 at NCBI define those two families by constructing sequence similarity groups using the COG procedure (Tatusov et al., 2001Go). The generalized sequence similarity searches such as the COG procedure are not free from false positives. One example of a likely false positive in the COG2104 is the Cj1047c gene that is consistently predicted as an all-alpha protein (in contrast with the mostly beta-sheet ThiS structure) by different fold recognition and secondary structure prediction methods (data not shown). We have looked at the question of whether or not structural alignments between ubiquitin and ubiquitin structural analogs indicate a conservation of residues important for the ubiquitin function. We have considered two functional features of ubiquitin sequence: the C-terminal Gly–Gly dipeptide important for activation of proteins, and Lys48 involved in the ubiquitination. Recently, two members of the ThiS family, ThiS and MTH1743 (PDB codes 1f0zA and 1jsb), and MoaD protein from Escherichia coli (PDB code 1fmaD) had their structures determined (Rudolph et al., 2000Go; Wang et al., 2000Go; Yee et al., 2002Go). The analysis of the multiple sequence alignments within each sequence similarity group (COG) and structural alignment among known structures lead us to the conclusion that there are only four archaeal sequences that have the functional features of ubiquitin conserved: aq_025a, AF0737, MTH1743 and TVN0569.

Figure 3 shows the alignment of MTH1743 (1jsb), aq_025a, AF0737, TVN0569, 1fmaD and ubiquitin (1ubi) as inferred from the structural alignment (Holm and Sander, 1996Go) between 1ubi, 1jsb, 1fmaD and the alignment of ThiS and MoaD families. We have used pairwise structural alignments between 1ubi–1jsb, 1ubi–1fmaD and 1jsb–1fmAD and checked that they do agree among all three structures. These pairwise structural alignments were progressively combined to generate the alignment in Figure 3. We have checked that using an alternative method of structural alignment CE by Shindyalov and Bourne (Shindyalov and Bourne, 1998Go) and using a structure of ThiS protein (1f0z) produced an identical alignment to that in Figure 3. The MTH1743, AF0737 and aq_025a proteins are, respectively, 19% (13/70), 14% (10/69) and 13% (9/67) identical to ubiquitin, where the numbers in parentheses indicate the number of identities and the length of the protein. The observation that prompted us to focus on those three genes was that ubiquitin’s Lys48 involved in the covalent assembly of poly-ubiquitin chains is in a similar three-dimensional position as the Lys42 in AF0737, Lys47 in MTH1743 (1jsb) and Lys38 in aq_025a genes (see Figure 4). Lys48 in ubiquitin occupies a structural position at the end of a two residue long tight turn between strands three and four. According to the alignment in Figure 3, Lys47 of 1jsb (and corresponding lysines in AF0737 and aq_025a) is positioned just at the start of the same tight turn. Among 24 members of COG02104, only AF0737, MTH1743 and aq_025a have Lys in this position. In the remaining 21 ThiS-COG sequences the closest lysine is at least five residues away. Lys42 in AF0737 is predicted to occupy a structural position that would permit the participation in the covalent bond analogous to the one observed in poly-ubiquitin (see Figure 4). Additionally, all three sequences have the terminal Gly–Gly peptide that is conserved in a number of ubiquitin-related proteins, where the activation of proteins occurs. However, the Gly–Gly conservation is observed in the whole ThiS family. A similar investigation of the MoaD family (COG1977) found that there is only one member of this family, the TVN0569 gene that has similar sequence features. The TVN0569 has only 11% (10/90) with ubiquitin and the Bayesian fold recognition method does not find it to be compatible with any known structural fold (including the MoaD structure). Moreover, the structure of MoaD itself exhibits a higher degree of structural difference with ubiquitin than the ThiS structure does (see Figure 3). Thus, the archaeal genes AF0737, MTH1743 and aq_025a bear a stronger overall resemblance to ubiquitin.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 3. The alignment of ubiquitin-like proteins to ubiquitin (PDB code 1ubi). The recently solved structures of ThiS protein MTH1743 (PDB code 1jsb) and MoaD protein (PDB code 1fmaD) were used. The multiple alignment is based on pairwise structural alignments between 1jsb–1ubi and 1ubi–1famD structures. Pairwise structural alignments were used progressively to generate the alignment presented here. Two structural alignment methods were consulted: FSSP and CE, and both generate the same pairwise alignments. The 1ubi structure was then taken as a reference to generate a common alignment among the three structures. For proteins that have no known structure, aq_025a, AF0737 and TVN0569, we first generated an alignment within each family (ThiS and MoaD) using CLUSTAL and the alignment between those and proteins with known structures (1jsb and 1fmaD, respectively) is represented above. The Lys48 of ubiquitin and the corresponding Lys42 in AF0737, Lys47 in 1jsb and Lys38 in aq_025a proteins are in blue as well as the conserved GLY-GLY peptide. The secondary structure elements are highlighted green for ß-strands, red for {alpha}-helices and magenta for short (four or five residues long) or 310-helices. We distinguish long {alpha}-helices from short and 310-helices since these SS elements are often replaced by unstructured loops in homologs. For the bacterial and archaeal sequences aq_025a, AF0737, 1jsb, 1fmaA and TVN0569, the residues identical to ubiquitins are underlined.

 


View larger version (29K):
[in this window]
[in a new window]
 
Fig. 4. The alignment of ubiquitin (1ubi) and the MTH1743 structure (1jsbA). The ubiquitin structure is in a ribbon representation and the MTH1743 structure is represented as a backbone. Lys48 involved in ubiquitination, Phe45 of ubiquitin and Lys47 of the 1jsbA structure are indicated by ball-and-stick representation. According to the structural alignment, the Phe45 position is occupied by a lysine in the archaeal sequences aq_025a, AF0737 and MTH1743 (Lys47 in the 1jsb structure).

 
We have also investigated whether there is evidence at the codon level of a pressure for the conservation of functional amino acids at specific structural positions. The conserved Gly–Gly peptide is coded by different codons across the archaeal genes AF0737, MTH1743, aq_025a and ubiquitin. However, this is consistent with the observation that in other known ubiquitin-related proteins from Eukarya (like SUMO and RUB1) the Gly–Gly peptide is coded by different codons as well. On the other hand, the structurally conserved lysine is coded in all genes (AF0737, MTH1743, aq_025a, ubiquitin) by the same ‘AAG’ codon. There is only one amino acid that is conserved among all four proteins in the structural alignment (see Figure 3), the glutamate 16 in the 1ubi structure and the corresponding residues in archaeal proteins. This glutamate occupies a structural position at the end of the second beta-strand and in all genes is coded by the ‘GAA’ codon. To our knowledge it is not known if this glutamate plays any functional role. Even though there is a minimal conservation of residues across all four genes, the codon conservation supports our structurally based hypothesis.

In the Introduction we have noted that discrimination between analogy and homology is a difficult problem. Thus, it is still possible that the AF0737, MTH1743 and aq_025a genes are just structural analogs of ubiquitin since they are identified as members of the ThiS family. However, our sequence and structure alignment analysis supports the hypothesis that AF0737, MTH1743 and aq_025a genes are the best candidates for ubiquitin homologs in prokaryotes.


    Conclusions
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
We have presented a method for distant homolog identification that combines a structural fold recognition approach as a first step in the process that uses two subsequent sequence similarity assessments to eliminate possible false positives and to assign the functional function to identified sequences. First, the fold recognition method focuses on a set of proteins that are likely to have a similar structure. Further sequence similarity analysis allows clustering of those proteins into similarity groups. Subsequent analysis of those groups and similar sequences can suggest candidates for functional homologs.

Using our search method, in addition to known homologs, we have identified 90 small proteins from the sequenced genomes of Archaea, Bacteria, C.elegans and A.thaliana that are predicted to be similar in structure. Among those 90, we have identified 29 un-annotated proteins as belonging to various ubiquitin-like protein superfamilies. Three of those proteins are in eukaryotes, six are in Archaea and 20 are bacterial proteins. For those 29 proteins, a probable homolog cannot be identified by the sequence comparison. One C.elegans protein was classified as a member of the ubiquitin-like superfamily. Seven sequences were identified as 2Fe–2S ferredoxins and 21 were identified as members of the MoaD/This superfamily. Further analysis of the structural alignments and alignments of MoaD and ThiS COGs lead us to the selection of three genes as coding for ubiquitin-related proteins. Given the conservation of the structural position of functionally important lysine, we propose that archeal proteins AF0737, MTH1743 and aq_025a are distant homologs of ubiquitin. Other proteins identified by this method are more likely to be just structural analogs. Given a low false positive rate (8%) of the Bayesian fold recognition method and further elimination of those, we believe that the newly identified set of distant homologs of ubiquitin-like, ferredoxin and MoaD/ThiS is a very promising set of candidate proteins for the experimental confirmation of predicted functions.

The method proposed here provides a new means for distant homolog identification where methods based only on sequence similarity do not apply. This method is not proposed as a replacement for the sequence similarity searches but is helpful in situations where no homolog can be found using sequence comparison, as is often the case for short proteins. The identification of the potential homolog provides a starting point for the further analysis of the sequence and structure derived conservation that may give additional evidence supporting the common ancestry hypothesis. Alignments to a homolog with a known structure need to be inspected for the conservation of the core structural elements. The sequence similarity that is inferred from the structural alignment additionally supports the functional inference. Even more meaningful is the conserved three-dimensional location of functionally important residues. The method presented here can be applied to re-examine homology among many proteins shorter than 100 amino acids. For such short proteins the standard sequence similarity analysis often does not provide statistically significant predictions to assign homology.


    References
 Top
 Abstract
 Introduction
 Materials and methods
 Results and discussion
 Conclusions
 References
 
Abagyan,R.A. and Batalov,S. (1997) J. Mol. Biol., 273, 355–368.[CrossRef][ISI][Medline]

Altschul,S.F., Madden,T., Schaffer,A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389–3402.[Abstract/Free Full Text]

Bienkowska,J.R., Yu,L., Rogers,R.G.,Jr, Zarakhovich,S. and Smith,T.F. (2000) Proteins, 40, 451–462.[CrossRef][ISI][Medline]

Bienkowska,J.R., He,H. and Smith,T.F. (2001) IEEE Intell. Syst. Biol., 16, 21–25.

5 CASP5 (2002) Critical Assessment of Protein Structure Prediction 5, Asilomar Conference Center, December 2002, Asilomar, CA USA. http://predictioncenter.llnl.gov/casp5/

Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P., Levitt,M. and Brenner,S.E. (2002) Nucleic Acids Res., 30, 260–263.[Abstract/Free Full Text]

Hadley,C. and Jones,D.T. (1999) Structure, 7, 1099–1112[ISI][Medline]

Higgins,D., Thompson,J., Gibson,T., Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673–4680.[Abstract]

Hochstrasser,M. (2000) Nat. Cell Biol., 2, 153–157.[CrossRef]

Holm,L. and Sander,C. (1996) Science, 273, 595–560.[Abstract/Free Full Text]

Jones,D.T. (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol., 287, 797–815.[CrossRef][ISI][Medline]

Karlin,S. and Altschul,S.F. (1993) Proc. Natl Acad. Sci USA, 90, 5873–5877.[Abstract]

Karplus,K. et al. (1999) Predicting protein structure using only sequence information. Proteins, Suppl. 3, 121–125.[Medline]

Kumar,S., Tamura,K., Jakobsen,I.B. and Nei,M. (2001) Bioinformatics, 17, 1244–1245.[Abstract/Free Full Text]

Lathrop,R.H. et al. (1998) A Bayes-optimal sequence-structure theory that unifies protein sequence-structure recognition and alignment. Bull. Math. Biol., 60, 1039–1071.[ISI][Medline]

Murzin,A., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, 536–540.[CrossRef][ISI][Medline]

Nei,M. and Kumar,S. (2000) Evolution and Phylogenetics. Oxford University Press, New York.

Rudolph,M.J., Wuebbens,M.M., Rajagolpalan,K.V. and Schindelin,H. (2000) Nat. Struct. Biol., 8, 42.[CrossRef][ISI]

Rychlewski,L. et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci., 9, 232–241.[Abstract]

Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739–747.[CrossRef][ISI][Medline]

Stultz,C.M., White,J.V. and Smith,T.F. (1993) Protein Sci., 2, 305–314.[Abstract/Free Full Text]

Tatusov,R.L. et al. (2001) Nucleic Acids Res., 29, 22–28.[Abstract/Free Full Text]

Wang,C., Xi,J., Begley,T.P. and Nicholson,L.K. (2000) Nat. Struct. Biol., 8, 47.[CrossRef][ISI]

White,J.V. (1988) In Bayesian Analysis of Time Series and Dynamic Models. Marcel Dekker, New York, pp. 255–283.

Yee,A. et al. (2002) Proc. Natl Acad. Sci. USA, 99, 1825.[Abstract/Free Full Text]

Received December 1, 2002; revised October 22, 2003; accepted October 24, 2003





This Article
Abstract
FREE Full Text (PDF)
Alert me when this article is cited
Alert me if a correction is posted
Services
Email this article to a friend
Similar articles in this journal
Similar articles in ISI Web of Science
Similar articles in PubMed
Alert me to new issues of the journal
Add to My Personal Archive
Download to citation manager
Search for citing articles in:
ISI Web of Science (5)
Request Permissions
Google Scholar
Articles by Bienkowska, J. R.
Articles by Smith, T. F.
PubMed
PubMed Citation
Articles by Bienkowska, J. R.
Articles by Smith, T. F.