From the GeneFormatics, Inc., 5830 Oberlin Drive, Suite 200, San Diego, CA 92121; and ¶ ActivX Biosciences, Inc., 11025 North Torrey Pines Road, Suite 120, La Jolla, CA 92037
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Proteome analysis must be followed by the function annotation or characterization of each expressed protein, information that is at the core of biological understanding and is essential in the pharmaceutical industry for development of small molecule inhibitors. Many proteins identified by large-scale proteomics methods cannot be assigned a biochemical function. For example, a recent proteomics analysis of the rice proteome identified 2528 unique proteins in leaf, root, and seed tissue. Basic sequence-based approaches to functional classification of these proteins showed that the most abundant group (31.8%) belonged to a protein family of unknown function or exhibited low sequence identity to proteins of known function (12). Additional approaches are necessary to further determine the functional class and functional state of relevant components of the proteome.
Approaches aimed at functional analysis of proteomes are being developed. These include, for example, computational methods utilizing sequence comparison (13, 14), methods focused on functional site analysis (1517), methods identifying protein-protein interactions (18), chemical proteomics approaches aimed at tagging functional sites on a large scale (1922), and metabolomic methods (23). To overcome limitations of individual analyses and to provide a more accurate and precise functional analysis, we have combined synergistic computational and chemical proteomics approaches to fractionate the well-studied yeast proteome into functional subsets with high confidence.
In this work, we focused on the identification of serine hydrolases in yeast. Serine hydrolases are of interest because of their range of biological activities and because they are targets of several pharmaceutical agents. Serine hydrolases are present in all organisms and are active in diverse cellular compartments and functions. This class of enzymes includes proteases involved in the coagulation cascade (24); amidases responsible for the metabolism of endogenous signaling molecules (25); penicillin-binding proteins responsible for antibiotic sensitivity (26); and carboxylesterases involved in the metabolism of pharmaceuticals (27). Current drugs targeted against specific human serine hydrolases include Angiomax® for cardiovascular disease, Xenical® for obesity, and Aricept® and Cognex® for Alzheimers disease, as well as drugs in development for diabetes, arthritis, and cancer. Serine hydrolases are highly regulated and often present in low abundance, characteristics that present significant challenges to current methods of proteomic analysis. In addition, serine hydrolase activity is exhibited by enzymes distributed across most International Union of Biochemistry and Molecular Biology Enzyme Classification (EC) classes and is found in a wide variety of tertiary structures (Fig. 1), enzymatic functions (Table I), and mechanisms. Active site diversity gives rise to several mechanisms that lower the pK of the key catalytic, nucleophilic serine. Both Ser-His-Asp catalytic triads and Ser-Lys catalytic dyads (28, 29), arranged in a particular three-dimensional configuration, can carry out serine hydrolase activity. Any method for proteome-scale analysis of serine hydrolases must adequately handle this mechanistic and structural diversity, thus this system was chosen as a difficult challenge for our combined proteomics methods.
|
|
We independently applied structural and chemical proteomics technologies to identify yeast proteins that exhibit serine hydrolase function and then compared the results. Both of these function-based proteomics methods identify a large number of proteins. Fifteen serine hydrolase proteins are identified by both methods, and these are designated as high-confidence annotations. About half of these high-confidence identifications are known serine hydrolases, and about half are annotated here for the first time. Remarkably in this well-studied genome, the combined whole-proteome methodologies uncover a family of serine hydrolases in yeast not previously recognized, which we designate as Fsh (family of serine hydrolases). The results of this study demonstrate the utility of combining independently two complementary and synergistic function-based approaches to produce a more accurate analysis of complex proteomes.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
Purification and LC-MS/MS of Labeled Proteins in the Yeast Proteome
To identify as many serine hydrolases as possible, yeast cultures were grown in four different media: ideal (YPD, 2% dextrose), aerobic oxidation (YP with 2% galactose and 0.5% lactate), anaerobic fermentation (YP with 3% ethanol), and sporulation (1% KOAc). After growth under each condition, the yeast cells were lysed, centrifuged first at 15,000 x g, then at 100,000 x g. For each growth condition, proteins from both fractions of the high-speed spin only were labeled as described above with either a biotin-containing or a tetramethylrhodamine-containing ABP. After the ABP labeling, the reactions were quenched by addition of solid urea (6 M final concentration), followed by sequential treatment with DTT (10 mM final concentration) and iodoacetamide (40 mM final concentration) to reduce and alkylate free cysteines, respectively. After gel filtration to remove urea, DTT, and iodoacetamide, the labeled samples were subjected to affinity chromatography using either avidin agarose (Sigma) or an anti-rhodamine monoclonal antibody-agarose (prepared at ActivX). The resins were washed with buffer containing 1% SDS. Eluted proteins were separated by one-dimensional SDS-PAGE, and labeled proteins were excised and in-gel digested with trypsin following standard protocols (38). Tryptic peptides were analyzed using a ThermoFinnigan (San Jose, CA) LCQ Deca XP and either Sequest or Mascot software, essentially as previously described (39). Results from the four growth conditions were combined. To control for the appearance of abundant proteins nonspecifically bound to the affinity matrices, parallel experiments were conducted wherein the ABP-labeling step was omitted. Proteins identified in the control experiments were subtracted from those identified in the ABP-labeling experiments.
Serine Hydrolases Fuzzy Functional Forms (FFFs)
A set of serine hydrolase FFFs, structural motifs for identification of functional sites, was used to identify proteins in the S. cerevisiae proteome. As described in previous work (40, 41), physicochemical and structural data from Protein Data Bank (PDB) entries are combined with activity information from the biochemical literature to identify the key functional residues. Each FFF is defined by the following criteria: one or a small number of residue identities for each key residue, a set of geometric descriptors describing the relative orientation of the key residues, and the allowed variability (a standard deviation) for each geometric descriptor. As previously described (15, 41), a standard cross-validation training procedure creates each FFF to uniquely recognize the true positive structures. In this work, the resulting serine hydrolase FFFs were sensitive enough to discriminate between known serine hydrolase functional sites and all other proteins in a test database of 12,009 PDB structure files (PDB, release 092).
The set of serine hydrolase template structures and FFFs were selected based on the following criteria (20): 1) the FFF describes a function requiring a nucleophilic serine; and 2) the FFF describes protease, lipase, esterase, amidase, or transacylase enzymatic activity. In addition, the flavin adenine dinucleotide-independent (S) hydroxynitrile lyase FFF was selected. While these lyases are not currently identified as members of the serine hydrolase family, the proteins have a nucleophilic serine, a characteristic Ser-His-Asp catalytic triad, and an /ß hydrolase fold (42, 43). Also included is a transacylase (malonyl-CoA acyl carrier protein transacylase) that carries out a transferase enzymatic function, but is identified as a serine hydrolase (20).
Structure and Function Assignment Using the FFFs
A total of 6946 open reading frames (ORFs) from the yeast genome were threaded against a nonredundant dataset of known structures using the Prospector threading algorithm (44). Thirty-five serine hydrolase FFFs were then applied to the top five most significant threading alignments for each of four different scoring functions to identify the function(s) and active site(s) of the protein encoded by each ORF, as previously described (40). The genome sequences that aligned correctly to serine hydrolase structures in the structure library, according to the automated FFF match procedure, were identified as serine hydrolases and were further analyzed as described in the "Results" section.
To determine confidence in the overall threading alignments, a standard Z-score was calculated. To determine confidence in FFF function assignments, active site profiling was used (45). Briefly, experimental structures that display the particular functional activity described by an FFF (true positive structures) are aligned in three-dimensional space. Then, superimposed sequence fragments surrounding the FFF motif in space (illustrated in Fig. 1) are extracted from each structure and their sequences are aligned using CLUSTALW (46, 47). This alignment of the fragments from the active site vicinity in known structures is termed an active site profile for a given function or FFF. For each predicted functional site, the local fragments around the FFF-identified active site residues are extracted, aligned with the active site profile from the structures known to exhibit the function, and scored against these active site profiles. Each residue position in the functional site profile is scored by identity, conservation, and the presence of a gap. For a gap-free alignment, the score varies from 0 to 1. When gaps are introduced into predicted functional site profiles, the score can fall below zero. High confidence function annotations have functional site profile scores greater than 0.25 (45).
Function Identification Using Motif Databases
By definition an FFF serves as a template of the underlying chemical functionality of a protein, so equivalencies can be defined between FFFs and public tool motifs that describe the same or a related function; thus, motif equivalencies were established between FFFs and Pfam, BLOCKS and PRINTS motifs. The threading/FFF results were compared with the results obtained using three sequence motif databases: PRINTS 20.0 (48, 49); Pfam version 6.0 (50, 51); and BLOCKS (52, 53). These databases receive a sequence as input and output a list of sequence motifs ranked by score that may match the function of the query sequence. The top 10 hits by PRINTS and all query sequences above cutoff scores of 10 for Pfam and 5 for BLOCKS were analyzed to determine if the motifs identified a function equivalent to the FFF-assigned function. In addition, BLAST (54, 55) was used to assign function based on annotation transfer. Function assignment is inferred from sequences with similarity to the query sequence. For this study, a cutoff value of 0.01 was used, to ensure that this analysis identified distantly related sequences.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
For this study, FFFs were created to describe and identify the diversity of serine hydrolase functional sites (such as the examples in Fig. 1). A common feature of these FFFs is the inclusion of a nucleophilic or active serine. The library utilized in this study contained 35 different serine hydrolase FFFs, including six composite FFFs (Table I). Composite FFFs are defined when more than one FFF describes a functional family or subfamily, a feature that allows identification of multiple biochemical activities within a functional site. For example, XE3.4.12 is a composite FFF composed of two individual FFFs: the serine protease/serine carboxypeptidase family catalytic site (E3.4.33) and the serine carboxypeptidase family pH regulatory site (E3.4.39) (Table I). In cross-validation studies, the FFFs in the serine hydrolase library were shown to uniquely identify serine hydrolase functional sites in experimentally determined structures (see "Experimental Procedures").
The set of 35 FFFs describe a diverse group of 25 different serine hydrolase functions (Table I). A discrepancy in the number of FFFs, 35, versus the number of different serine hydrolase functions, 25, exists because an enzyme function, as defined by the EC system, can be described by more than one FFF. For example, lipase (serine hydrolase defined by EC 3.1.1.3) catalytic sites can differ between bacterial and fungal organisms in structure and/or sequence, as illustrated by FFFs E3.1.58.1 (a bacterial lipase catalytic site) and E3.1.17 (fungal triacylglycerol lipase catalytic site). In these instances, two FFFs are necessary to describe the structural motifs that carry out this single EC-defined function. A common fold associated with serine hydrolases is the /ß fold family (58). Some serine hydrolases assume the
/ß structure; however, many do not. Further, not all
/ß hydrolases are serine hydrolases. FFFs can distinguish
/ß hydrolases that exhibit a serine hydrolase function from those proteins that fold similarly to an
/ß hydrolase but exhibit another function altogether (56). Sixteen FFFs in the library describe serine hydrolase function in an
/ß hydrolase fold, including a single "family" FFF that is designed to recognize all the serine hydrolase proteins with
/ß hydrolase fold (E1.11.2) (Table I). There are 19 FFFs that describe serine hydrolases with a fold other than the
/ß hydrolase fold, including four composite FFFs (Table I).
To predict how successful the computational functional identification or annotation might be, we wanted to estimate the FFF coverage of the total currently defined structural space and the total serine hydrolase biological functional space. "Known serine hydrolase structural space" is based only on structures available in the Research Collaboratory for Structural Bioinformatics (RCSB) PDB (version 092). Approximately 63% of the total serine hydrolase structural space available at the initiation of the study was covered by this set of FFFs (Fig. 2). The FFFs used in this study describe several serine hydrolase subclasses, including serine proteases, serine lipases, serine esterases, serine transacylases, and (S) hydroxynitrile lyases. The FFF coverage of each of these subclasses, based on known structural space, ranges from 55 to 100%, with the exception of the serine amidases (Fig. 2). The serine amidase subclass FFF is conspicuously missing because of the limited structural coverage of serine hydrolase functional space at the time of this study.
|
Activity-based Labeling of Yeast Serine Hydrolases
ABPs have been developed that are able to interact specifically with active serine hydrolases in complex protein mixtures, including whole cells (for reviews, see Refs. 59 and 60). One of the most powerful aspects of the ABP technology is that it efficiently fractionates the proteome based on chemical reactivity, not on protein abundance. Because of the ABPs ability to label the functional subset of the proteome, simple separation methods, such as one-dimensional gel electrophoresis, are able to resolve the bulk of the proteins of interest (Fig. 3, for example). In general, ABPs contain three subunits: a) a reactive moiety specific for an amino acid in the active site of enzymes of a particular class, b) a linker, and c) a tag that enables visualization or purification of probe-modified proteins. For the experiments performed here to identify the active site serine hydrolases from the yeast proteome, the reactive moiety was a fluorophosphonate, related to the broad specificity serine hydrolase inhibitor diisopropyl fluorophosphonate, and the tag was either biotin or tetramethylrhodamine (19, 20). An example of the activity profile obtained from the different centrifugation fractions is shown in Fig. 3. The pellet from the low-speed spin contains mostly unbroken cells and large fragments. The proteins identified in the supernatant from the low-speed spin were identical to proteins found in the fractions obtained from the high-speed spin. Thus, only the fractions from the high-speed spin were used in subsequent experiments. Comparison of the proteins identified using the avidin affinity column and the antibody affinity column shows that avidin affinity column binds more proteins. Abundant or sticky proteins, i.e. those that bind readily to either avidin or antibody columns without the ABP treatment, are also readily identified by other methods (8) (Fig. 3). Output from the two affinity chromatography methods were combined to generate the results described here.
To demonstrate that these ABPs label yeast proteins in an activity-dependent manner, whole-cell yeast extracts were labeled with a serine hydrolase ABP. In the first experiment (Fig. 4A), a protein extract was labeled either with or without prior pretreatment with phenylmethylsulfonyl fluoride (PMSF). PMSF, like diisopropyl fluorophosphonate on which the ABP was based, is a broad-spectrum serine hydrolase inhibitor. As such, it was not surprising that several proteins labeled by an ABP were not labeled after the extract had been treated with PMSF (Fig. 4), demonstrating that ABPs do not recognize proteins that are inactive. Interestingly, several proteins were labeled by ABP after treatment with PMSF (Fig. 4A), indicating that not all yeast proteins with nucleophilic serines are completely inhibited by 1 mM PMSF, an observation that is not without precedent (61).
|
Mass Spectrometric Identification of ABP-labeled Serine Hydrolases in the Yeast Proteome
To identify as many serine hydrolases in the yeast proteome as possible, yeast were grown under four different conditions, ideal, aerobic oxidation, anaerobic fermentation, and sporulation, and the results were combined. Fractions were collected after centrifuging at 100,000 x g and labeled with the ABPs. Proteins were extracted, subjected to trypsin proteolysis, and analyzed by LC-MS/MS. Comparison of the proteins bound to the affinity matrix with and without ABP labeling showed 80 proteins uniquely labeled by an ABP. Further analysis generated two populations of these proteins. One population of 23 proteins (Table II) produced high-quality mass spectrometry data, wherein multiple peptides per protein were identified and/or the same protein was identified in multiple experiments. Proteins belonging to the other population, though not identified in the control experiments, gave weaker mass spectrometry results (only one peptide from the protein or identified in only one experiment) and may not have been modified with the ABP. The 57 proteins identified by this lower-quality data are listed in a footnote to Table II.
Of the 23 proteins identified by ABP labeling with high-quality mass spectrometry data, eight were previously annotated as hydrolases (Dap2, Kex1, Ppe1, Prb1, Prc1, Ste13, Yhl068c, and Amd2; Table II). One additional protein, Fas2, was previously annotated as 2-oxo-acyl carrier protein reductase/synthase and encodes the subunit of yeast fatty acid synthase. (This function, fatty acid synthase, was also identified computationally by the Prospector threading algorithm, see results below.) The function annotation in SGD (at the time of this study) for the other 14 ABP-identified proteins is "function unknown" (Table II), and these experimental results alone now suggest the presence of a nucleophilic serine at the functional site in these 14 proteins.
Computational Identification of Serine Hydrolases in the Yeast Proteome
To analyze the yeast proteome computationally, the set of serine hydrolase FFFs were applied to the proteins encoded by the S. cerevisiae genome, as described in "Experimental Procedures." Briefly, threading alignments for each yeast amino acid sequence were generated using the Prospector threading algorithm (44), and confidence in each threading alignment was determined by a Z-score. FFFs were then applied to the top 20 threading alignments to identify the function(s) and active site(s) of each protein. The combination of a structure prediction and an FFF-based functional assignment for any sequence identified a putative serine hydrolase. Confidence in this function assignment was determined by calculating an active site profile score (45).
Overall, 19 individual hydrolase FFFs (Table I) identified 146 serine hydrolase protein sequences in the yeast genome (Table II and footnotes). Ten serine hydrolase FFFs did not hit any yeast sequences (Table I). Both component parts of one composite serine hydrolase FFF, XE3.4.12, hit two S. cerevisiae sequences, while the other five composites did not hit any sequences. Fifty-two of the 146 sequences were hit by more than one FFF. In these cases, both the protein family FFF and a more specific serine hydrolase FFF identified the functional site. For example, Ybr139w was identified by FFFs E1.11.2, E3.4.33, and E3.4.39 (serine hydrolase//ß family, serine protease/serine carboxypeptidase family catalytic site, and serine carboxypeptidase family pH regulatory site, respectively; Table II). These multiple hits add confidence in the function assignment because the FFF technology recognizes active site structural and chemical features of both the family and a subclass of proteins.
Z-scores and active site profile scores were calculated for each sequence annotated by a serine hydrolase FFF (Figs. 5 and 6). Z-scores are a quantitative measure of the confidence in a global sequence alignment between a yeast sequence and a serine hydrolase whose structure has been determined. The active site profile score, on the other hand, quantifies the similarity between the sequence and a structurally determined serine hydrolase only in the region of the functional site (45). Without an FFF annotation, a Z-score greater than or equal to 5.0 is considered significant for a threading alignment produced by the Prospector version used in this study. Active site profile scores of 0.25 or greater are considered significant, regardless of the Z-score. The Z-scores for the threading alignments hit by serine hydrolase FFFs range from less than 2 to greater than 20 (Fig. 5A), with no correlation between score and high-confidence ABP label. Of eight proteins previously annotated as serine hydrolases by SGD (and identified by ABP labeling), only four (Kex1, Prb1, Prc1, and Ste13) align to known serine hydrolase structures with a Z-score greater than 5. Three others (Dap2, Ppe1, and Yjc068c) exhibit insignificant Z-scores of 2.8, 3.5, and 1.9, respectively (Table II). Amd2 did not align to a known serine hydrolase structure using this threading algorithm.
|
|
Thirty-three of 52 FFF-identified proteins with significant profile scores (Table II and footnotes) were annotated as "function unknown" in SGD at the time of this study, so the computational results alone provide possible indication of function for these proteins. Some of these proteins, such as Kex2 and Ysp3, are known hydrolases identified by the FFFs. (But not identified by ABP labeling. It is probable that these proteins were not expressed or were expressed, but inactive, under the four expression conditions studied.) Other sequences with significant profile scores, including Yar009cp, Ycl019w, and Ydr034c-d, have an SGD annotation of "protease" so the current computational analysis provides some additional information to support that annotation. A small number had previously been annotated, albeit not as serine hydrolases. Two of these, Ynl277w and Yfr027w, were annotated as acetyl transferases. Although, acetyl transferases were not specifically covered by the FFFs used in this study, the malonyl-CoA acyl carrier protein transacylase function was covered (E2.3.5; Table I), as this function is suggested to be a serine hydrolase (20). It is possible that the FFFs are correctly identifying an active site serine in these proteins. Four other proteins were annotated in SGD with other functions: Yjl045w, succinate dehydrogenase; Ynd055c, voltage-dependent ion-selective channel; Yor191w, DNA-dependent adenosine triphosphatase; and Ymr234w, Rnh1 or ribonuclease HI. The relationship between these annotations and the FFF-based annotations, if any, remains to be determined.
Comparison of Computational and Experimental Proteomics Methods: Serine Hydrolases Identified by ABP Labeling and FFFs
Under four expression conditions, ABP labeling identified 23 proteins with high-quality mass spectrometry data (Table II). If all of these are correct identifications, FFF analysis identified over 65% (15 of 23) as serine hydrolases (Table II). Based on estimates of FFF coverage of structural space (63%, Fig. 2) and functional space (23%), this is the expected result. Of the 15 proteins identified both by ABP labeling (high-quality mass spectrometry data) and by serine hydrolase FFFs, seven had been annotated at SGD prior to this work (Dap2, Kex1, Ppe1, Prb1, Prc1, Ste13, and Yjl068c; Table II). Eight sequences (over 50%) were previously annotated as "function unknown" or "hypothetical protein" at the time of this work (Eht1, Yju3, Ybr139w, Ybr204c, Yhr049c, Ylr118c, Ymr222c, and Yor280c; Table II). The combination of independently applied computational and experimental proteomics methods described in this paper allows confident assignment of serine hydrolase function to these proteins. The chemical proteomics technology indicates that a functional protein with an active serine is expressed in the cell. The computational proteomics technology adds details about the type of function, the structure of the functional site, and the specific residue that is likely labeled by the ABP. This combination of technologies adds significant knowledge about the family of serine hydrolases in the well-studied yeast organism.
Of the eight ABP-labeled proteins that were not annotated by serine hydrolase FFFs, one, Amd2, is annotated as a putative amidase in SGD. The lack of identification by an FFF is not surprising because no amidase FFF was available at the time of this study. Three of these eight ABP-labeled proteins (Ygl039w, Ygl157w, and Yml059c) were annotated by another common set of FFFs (Table II). These include FFFs covering the functions UDP-galactose-4-epimerase, estradiol-17-ß dehydrogenase, and 3-, 20-ß-hydroxysteroid dehydrogenase. The FFFs for these three functions have a common active site tyrosine and serine, but these functions were not included in the serine hydrolase FFF library. The mass spectrometric method used here does not report the amino acid labeled by the ABP. Thus, computational FFF analysis serves to clarify the function of these ABP-identified proteins and suggests the specific residues that may be labeled.
Five proteins identified by ABP labeling and high-quality mass spectrometry data were not annotated by any FFFs (Table II). Three of these, however, did thread to proteins whose structures had previously been determined: Yor084w threaded to 1a8uA, Fas2 threaded to 1kas, and Ynl123w threaded to 1pysB, all with significant Z-scores. 1a8uA is the structure of cofactor-free chloroperoxidase T, a known serine hydrolase. In the threading alignment, the active site serine aligns with a serine in Yor084w, but neither the active site aspartic acid or the histidine of 1a8uA are aligned with similar residues in Yor084w (data not shown); thus the FFF was unable to recognize this alignment. This protein is identified by the ABP, Prospector recognizes an overall similarity to chloroperoxidase T, and a potential active serine is recognized, so this protein may be a serine hydrolase. However, the alignment does not include a properly aligned active site that could be recognized by the complete FFF, so further experimentation would be required to understand the function of Yor084w. 1kas, to which Fas2 aligned, is the structure for ß keto acyl ACP synthase. Fas2 is a known 3-oxo-ACP (acyl carrier protein) reductase/synthase (annotation provided by SGD), so Prospector easily recognized this homolog. No FFF has been constructed to recognize active sites in this protein. This protein is known to have active serines, one of which binds the pantetheine prosthetic group (62, 63), and the active serines may be the ABP binding site in this protein. Again, both methods individually identified these proteins, but the methods are synergistic and together provide additional information to aid in the interpretation of the results.
A New Family of Eukaryotic Serine Hydrolases Identified by a Combination of Chemical and Computational Proteomics Methods
Of the 15 proteins identified both by the ABP labeling and FFF analysis, eight were previously of unknown function (Table II). Three of these, Yhr049w, Ymr222c, and Yor280c, are related to each other by sequence similarity (Fig. 7) and appear to constitute a novel family of serine hydrolases found only in eukaryotic proteins. We propose to call these proteins Fsh13 (Yhr049w/Fsh1, Ymr222c/Fsh2, and Yor280c/Fsh3). To compare how other computational proteomics methods annotate these proteins, Pfam, BLOCKS, and PRINTS sequence motif methods were applied to Fsh13. None of these methods were able to assign molecular function with high confidence to any of the three yeast proteins. PRINTS identifies Yhr049c/Fsh1 as a prolyl aminopeptidase, but at an insignificant E-value (e = 590). Likewise, Pfam annotates Fsh1 as a phospholipase/carboxylesterase with an insignificant E-value of 4.2. Pfam identifies Ymr222c/Fsh2 as phospholipase/carboxylesterase with an E-value of 0.12. None of the sequence motif tools annotated Yor280c/Fsh3 as any type of serine hydrolase.
|
BLAST database searches using the Fsh proteins identified a sequence from S. pombe, DYR_SCHPO, for which dihydrofolate reductase (DHFR) function has been shown by sequence comparison and has been experimentally confirmed (64). Based on a database search, the similarity between DYR_SCHPO and Yor280c/Fsh3p is judged to be significant, with an E-value of 2 x 10-25. Initially, this result was confounding because DHFR does not possess a nucleophilic serine that would account for labeling by a serine hydrolase ABP. Further sequence comparison, however, revealed the well-characterized DHFR from S. cerevisiae aligns to the C-terminal portion of DYR_SCHPO, indicating DHFR function only in the C-terminal region of the protein. Moreover, 90% of the Yor280c/Fsh3p sequence aligns to the N-terminal 232 residues of the S. pombe DYR_SCHPO protein. DYR_SCHPO appears to be a multifunctional protein, possessing serine hydrolase function in the N-terminal domain and DHFR function in the C-terminal domain.
OVCA2, a sequence encoded in the human genome, is likely to be a serine hydrolase as it aligns to Yor280c/Fsh3 with a BLAST E-value of 5 x 10-10 (Fig. 7). This protein is independently identified as a serine hydrolase by the FFF technology, and recombinant OVCA2 can be labeled with a serine hydrolase ABP (data not shown). OVCA2 is a 227-aa human protein encoded by a ubiquitously expressed gene identified near a tumor suppressor locus (65). Deletion of this gene has been correlated recently with incidence of esophageal squamous cell carcinomas (66), and the protein expression is down-regulated in a lung cancer cell line treated with retinoid derivatives (67). Although sequence similarity to rat and worm genes and the S. pombe DHFR sequences has been noted (67), biochemical or molecular function of this candidate tumor suppressor was not known previously. Results of this study demonstrate the serine hydrolase function of OVCA2 and the alignment shows that it does not contain the DHFR domain DYR_SCHPO.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
In this study, a unique combination of computational, structural, and chemical proteomics methods was independently applied to identify active serine hydrolases from S. cerevisiae. The computational method utilizes structural information to identify functional sites in sequences. This method has the advantages of identifying specific functional sites and not relying on global sequence or structure alignment, but suffers from false positives resulting from inaccurate threading alignments and spurious alignment of putative functional residues. Computational methods also suffer limitations due to the use of scoring cutoffs, causing the loss of true positive results. The chemical proteomics method has the advantages of experimentally identifying functional sites in whole cells and distinguishing between functional and nonfunctional proteins. This ABP-based method, however, provides results that are specific to the conditions of the experiment. Additionally, the ABPs used in this study react with nucleophilic hydroxyl groups, whether they are a part of serines in serine hydrolases or reactive serines or tyrosines in other enzymes.
Independent application of these methods and comparison of the results provides unique insight into these advantages and disadvantages. Both methods generate a significant number of results that could not be confirmed by the other method (footnotes, Table II). These other identifications are not necessarily incorrect. For instance, several of the FFF-identified proteins with significant profile scores are annotated as peptidases or peptide hydrolases in SGD, including Kex2 and Ysp3. A protein may be identified by the computational method and not by the chemical proteomics method because the correct condition for expression of active protein was not probed or tested. Alternatively, a protein may be identified by the chemical proteomics method and not the computational method because the protein is a novel serine hydrolase whose structure or functional site has not previously been described. Thus, the proteins identified by only one method await confirmation by other methods.
Fifteen proteins, however, were identified by both computational and chemical proteomics methods, and these are designated as high-confidence identifications. Seven of the 15 proteins were previously identified as serine hydrolases by other methods and are confirmed by the current analysis. Eight of the proteins were previously unannotated in the SGD database, thus the combined methods add a significant amount of knowledge regarding the function of these proteins. Because of the combined approach, confidence in these designations is high. Within these eight previously unrecognized proteins, we have discovered a novel family of eukaryotic serine hydrolases, which we designate as Fsh. Three related members of this family, Fsh1, Fsh2, and Fsh3, were identified in S. cerevisiae. Surprisingly, the Fsh family member protein found in S. pombe is fused to DHFR. The fusion of a serine hydrolase domain to DHFR indicates a possible novel pathway in folate metabolism that requires coordinated function of a serine hydrolase with DHFR in S. pombe and perhaps other organisms, even though the domains are not covalently fused in these other organisms.
The results are remarkable for the overlap, given the low coverage of serine hydrolase space by FFFs for computational proteomics and the inability to exhaustively test expression conditions for chemical proteomics. Given these practical limitations, the results demonstrate that both methods worked well and the synergies obtained from independent application of the two methods are significant.
Comparison of Combined Proteomics Analysis with Results from Other Proteomics Methods
Comparison of our combined proteomics methods with other experimental proteomics methods is difficult because experimental conditions and test sets are not the same. In addition, most technologies do not identify functions specifically. Yates and colleagues developed the MUDPIT technology linking 2D LC-MS/MS and performed a large-scale analysis of the yeast proteome (8, 9). This technology identified 1484 proteins expressed in yeast under one expression condition. Gygi and colleagues utilized multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS), with which they identified 7537 unique peptides and 1504 proteins under one expression condition (10). While these methods represent significant advances in analysis of complex proteomes, neither addresses the question of protein functionality. We can, however, determine how many of the serine hydrolases identified by the combined ABP and FFF technologies were also identified by these 2D LC-MS/MS technologies (Fig. 8). Of the fifteen sequences identified by the ABP/FFF technology, four were identified by Yates and colleagues (Kex1, Prc1, Eht1, and Yhr049w) and eight were identified by Gygi and colleagues (Prb1, Prc1, Yjl068c, Eht1, Ybr139w, Yhr049w, Ylr118c, and Yor280c).
|
We also compared the ability of the computational methods, FFFs and Pfam, to correctly annotate the proteins identified by ABP labeling. Of the 23 proteins identified by ABP labeling, we have already shown that FFFs identified 15. Pfam identified 10 of the 23 as serine hydrolases. Thus, the number of high-confidence identifications would be fewer if Pfam was used as the computational method. As described above, 14 of the 23 ABP-identified proteins were previously annotated as molecular function unknown. Of these 14 novel identifications by ABP labeling, FFFs identified eight and Pfam identified four as serine hydrolases. These results emphasize the synergies between the FFF and the ABP labeling technologies.
The results obtained by FFF and ABP labeling were compared with results obtained by other computational methods, including the local sequence signature databases and sequence-based function annotation tools (BLOCKS (52, 53), PRINTS (48, 49), and Pfam (50, 51)). A BLAST (54, 55) analysis was also used to assign function by sequence similarity to other annotated proteins in the NCBI GenBank nonredundant sequence database. Of the 146 FFF assignments, 87 were identified only by the FFF technology and not by any other computational tool. Thirteen of these novel hits (Yor280c, Ycl019w, Yol007c, Yhr134wp, Rnhlp, Yjl045w, Ynl182c, Ylr103c, Yor191w, Ybl089w, Ylr345w, Ypr147cp, and Yfr027w) have functional site profile scores greater than 0.25 (Fig. 6A, gray bars; Table II footnotes). One, Yor280c, was identified by ABP labeling. None of these novel hits have significant Z-scores (Fig. 5A, gray bars). This result emphasizes the similarity between what can be identified by public tools and by threading algorithms and also emphasizes the difference between these global alignment methods and what is identified by FFF analysis and active site profiling; however, further experimentation is required to understand its implications.
Analysis of High-confidence Identifications Provides Insight into the Limitations of Computational Scoring Methods
The Z-score distributions for the threading alignments on the 23 ABP-identified proteins are shown in Fig. 5B. There is no correlation between Z-score and confidence in ABP-labeled proteinsthe scores range from 1 to >20. About half of the high-confidence proteins have Z-scores greater than 5, and about half have Z-scores less than 5 (Fig. 5B). Z-scores less than five indicate statistically insignificant alignmentsusing only threading and this scoring statistic, there would be no confidence in the function identification for these proteins. Because Z-scores do not correlate with the high-confidence ABP hits, threading and similar methods that rely on the global alignment of sequences or structures are inadequate for assigning function between distantly related sequences.
On the other hand, methods that focus on functional sites themselves, such as FFF and active site profile analysis, correlate much better with the ABP-labeled proteins. Eighty percent of the high-confidence ABP-identified proteins have significant (greater than 0.25) active site profile scores (Fig. 6B). Use of a scoring function that focuses on the active site improves the correlation between computation and experiment and provides a better computational function annotation.
A Cautionary Tale Involving Annotation Transfer Based on Sequence Alignment
Function assignment is often based on annotation transfer when experimental evidence is unavailable; furthermore, annotation transfer based on sequence similarity is often applied in a high-throughput fashion, without manual curation. Two findings reported here highlight the risks of this approach, which has been pointed out by several other researchers (68, 69).
Using annotation transfer, the Fsh proteins might be assigned DHFR function, or at least be placed in the same protein family as DHFR, because of sequence similarity between the Fsh proteins and the N-terminal domain of DYR_SCHPO, a known DHFR. The data presented here suggests that DYR_SCHPO is a multifunctional protein, both an Fsh and DHFR, and the S. cerevisiae Fsh proteins presumably contain only one domain exhibiting serine hydrolase function. At the Yeast Proteome Database (www.proteome.com), Ymr222c/Fsh2 is predicted to have oxidoreductase function, presumably due to annotation transfer. Thus, the assignment of oxidoreductase function to Ymr222c/Fsh2 may be a faulty hypothesis and provides a cautionary example for post-genomic analyses.
A second such example can be found in another result from this study in which the candidate human tumor suppressor OVCA2 was found to have serine hydrolase function. Although the physiologic role of OVCA2 has not been determined, a recent study by Prowse et al. suggests a role in retinoid-induced growth arrest, differentiation, and apoptosis, and identifies homology with DHFRs (67). We show here that OVCA2 likely contains a serine hydrolase domain, but not a DHFR domain.
![]() |
CONCLUSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, November 24, 2003, DOI 10.1074/mcp.M300082-MCP200
1 The abbreviations used are: 2D, two-dimensional; ABP, activity-based probe; BLAST, basic local alignment search tool; DHFR, dihydrofolate reductase; DTT, dithiothreitol; FFF, fuzzy functional form; Fsh, family of serine hydrolases; LC-MS/MS, liquid chromatography coupled with tandem mass spectrometry; ORF, open reading frame; PBS, phosphate-buffered saline; PDB, Protein Data Bank; RCSB, Research Collaboratory for Structural Bioinformatics; SGD, Saccharomyces cerevisiae Genome Database; YPD, media containing yeast extract, peptone, dextrose; EC, Enzyme Classification; PMSF, phenylmethylsulfonyl fluoride.
* The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Current address: National Center for Genome Resources, Santa Fe, NM 87505.
|| Current address: Departments of Physics and Computer Science, Wake Forest University, Winston-Salem, NC 27109.
** Current address: Science Applications International Corp., Biomedical Information Solutions Division, 10210 Campus Point Drive MS A2F, San Diego, CA 92121.
Current address: Wadsworth Center for Laboratory Research, 120 New Scotland Avenue, Room 5046, Albany, NY 12201.
Current address: The Scripps Research Institute, Department of Molecular Biology, 10550 North Torrey Pines Road, La Jolla, CA 92037.
¶¶ Current address: Wake Forest University, Departments of Physics and Computer Science, Winston-Salem, NC 27109.
|||| To whom correspondence should be addressed: Departments of Physics and Computer Science, 100 Olin Physical Laboratory, Wake Forest University, Winston-Salem, NC 27109-7507. Tel.: 336-758-4957; Fax: 336-758-6142; E-mail: fetrowjs{at}wfu.edu
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|