From the Department of Chemistry and Biochemistry and the Molecular Biology Institute, University of California, Los Angeles, California 90095-1569 and the
Department of Microbiology, Montana State University, Bozeman, Montana 59717-3520
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Considering the broad range of target molecules, biochemists that study methylation have been blessed in that many of the AdoMet-dependent methyltransferases share common three-dimensional signatures (notably in the AdoMet binding regions) that are imperfectly reflected in similarities in their primary sequences (4). There are, at present, at least three structurally defined types of AdoMet-dependent methyltransferases. The major class (Class I) is based on a seven-strand twisted ß-sheet structure (4, 5). A second recently described class (Class II) is exemplified by the SET proteins (6). Finally, a last class (Class III) is the set of membrane-associated enzymes with multiple membrane-spanning regions (7).
Herein is described the unification of developed methods to mine the information available in gene primary sequences and the screening of entire genomes in the attempt to completely assign in silico all known and novel AdoMet-dependent methyltransferases of the major seven-strand twisted ß-sheet family. The common motifs for Class I AdoMet-dependent methyltransferases were first recognized in 1989 when three regions of similarity were noticed between the protein L-isoaspartyl O-methyltransferase and certain nucleic acid and small molecule methyltransferases (8). Over the years, these regions were expanded, largely by manual inspection of sequences, into Motif I, Post I, Motif II, and Motif III (9). These motifs were ultimately used for the first time in 1999 to scan the entire genome of Saccharomyces cerevisiae for putative methyltransferases (10). The result of the 1999 analysis was a list of 26 candidate S. cerevisiae open reading frames (ORFs).
The techniques used to perform the 1999 search relied heavily on the BLAST algorithm (11), a tool that performs sequence similarity searches. In this work, we describe three extensions of the search protocol for novel methyltransferases. Firstly, we have redefined the motifs using a positionally sensitive scoring matrix, for example where the first letter in the motif might be considered more important for a match than the third letter. Secondly, we have defined these motifs using an assortment of known methyltransferases with different substrate specificities. Finally, we have automated these tasks for easy refinement as more methyltransferases are discovered and to allow for the rapid screening of new genomes as they are sequenced. The results of motif analyses were verified and in some cases extended using sequence profile analysis implemented in PSI-BLAST (12) and HMMer (13), arguably two of the best tools for detection of the remote sequence homology.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In addition to the MSD that we have built, the Saccharomyces Genome Database (SGD)2 provides two databases of gene annotation. We have regularly downloaded these from the SGD and loaded them into local MySQL tables with similar table definitions to those used by SGD. The two databases are available as ftp://genome-ftp.stanford.edu/pub/yeast/tables/ORF_Descriptions/orf_geneontology.tab and ftp://genome-ftp.stanford.edu/pub/yeast/gene_registry/registry.genenames.tab.
One of the advantages to using MySQL is the multitude of programmatic interfaces. Using the methods described below, lists of putative methyltransferases will be generated, which then can be automatically scored by querying either our MSD or the SGD using locally built programs.
Non-weighted ("Canonical") Degenerate Pattern Searching
The most straightforward method of motif generation and searching is the process of aligning the amino acid sequences of known methyltransferases in the conserved motif regions and making a consensus sequence based on those regions. This is described as a degenerate pattern as each position can possibly be one of several amino acids and as non-weighted because no position is considered more or less important than another (Table II).
|
|
|
The MEME training set was built as follows and is represented graphically in Fig. 2. Entrez (the National Center for Biotechnology Information database query tool, www.ncbi.nlm.nih.gov/) was queried for the keyword "methyltransferase." Of the 5845 matches, all entries not from the RefSeq database were removed; the RefSeq database is the National Center for Biotechnology Informations curated set of entries that are designed to reflect the most highly accurate entries. The remaining 1064 entries were pruned using BLAST such that the final set did not contain any two sequences that matched with an expect value less than the desired threshold using the Blosum62 scoring matrix; the purpose of this culling is to remove entries that are highly similar to one another, which would lead to overrepresentation of certain sequences. With a cutoff expect value of 10-20, 289 sequences were in the final training set, of which 173 contributed to the definition of Motif I; the other 116 did not have regions similar enough to contribute to the Motif I definition and may represent Class II or III methyltransferases.
|
|
Automated Scoring of Candidates
Automated methyltransferase identification methods produce lists of gene names (open reading frames) that are putatively methyltransferases. The results of these searches need to be evaluated, including the rather large lists generated by the non-weighted degenerate searches. Evaluation is the process of taking a list of generated candidates and deciding for each candidate whether it is a known methyltransferase (a "hit"), whether it is known not to be a methyltransferase (a "miss" or "false positive"), or whether it is neither of the above (a "putative methyltransferase"). Systematic evaluation is performed as follows. If the candidate is in the MSD, assignment is based on that score (2 or 3 is a hit, -2 or -3 is a miss, and -1, 0, or 1 is a putative methyltransferase). Otherwise the annotation of the two SGDs (orf_geneontology.tab and registry.genenames.tab) is queried. If the annotations are marked as "unknown" the candidate is considered a putative methyltransferase. If the annotations contain the word "methyltransferase" the candidate is considered a methyltransferase. Otherwise, the candidate is considered an incorrect prediction (a false positive).
There are a number of inconsistencies in the SGD that can lead to inaccurate scoring. For example, HSL7, GCD14, and HemK are still not annotated as methyltransferases in the SGD (although they are in the MSD). This reflects that some genes are annotated as part of a pathway or have a phenotype but that the role as a methyltransferase was not initially known; for example HSL7 (YBR133c) is annotated as a negative regulator of the SWE1 kinase, but experimental evidence has confirmed the prediction of HSL7 as a methyltransferase (19).
Profile Searches Using PSI-BLAST and HMMer
A compilation of protein sequences in SCOP 1.61 (astral.stanford.edu/) and non-redundant SwissProt and TrEMBL databases (ftp://us.expasy.org/databases/sp_tr_nrdb/fasta/) was iteratively searched using the PSI-BLAST program (12). Each potential methyltransferase ORF sequence was used as the query with a profile inclusion E-value threshold of 0.001 and composition-based statistics turned on (20). The iterations were carried out for five rounds (or until convergence), and PSI-BLAST checkpoint files were saved for future use. The results of searches were inspected after each iteration to ensure that no compositionally biased sequences or spurious matches were included in the profile. To increase the sensitivity in the second step, candidate sequences and their corresponding checkpoint files from the first step were used as inputs for PSI-BLAST to scan the yeast proteome (genome-www.stanford.edu/Saccharomyces/). The searches were done for one iteration with the E-value set at 1e-5 to account for the smaller size of the yeast proteome compared with the database used to construct the profile. Potential methyltransferase ORF sequences were also individually compared with the Pfam 8.0 database (pfam.wustl.edu/), a collection of profile-hidden Markov models built from manually curated alignments of more than 5000 protein families (21). The searches employed the hmmpfam module of HMMer (13) (hmmer.wustl.edu/), and E-value threshold was set at 1.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
In addition to simple searches with the listed patterns, the sensitivity was increased by allowing errors (deviations from the proscribed pattern) to be introduced. As shown in Table V, part A, the number of results grows quickly as multiple deviations are allowed. However, the number of false positives (candidates that have a known non-methyltransferase function) also increases rapidly as deviations are allowed, suggesting that this approach is not a good one for identifying new methyltransferases. The large number of false positives comes from the fact that a best match at each position is accepted just as readily as a worst match at each position. For example, VLDVGCGPG is treated no differently than GSVTAAAVD; the latter would not be considered an acceptable Motif I based on known methyltransferase sequences.
In an attempt to reduce false positives, a restricted search motif was created by removing the unusual amino acids from the patterns (Table V, part C). The results from searching with the restricted motif sets are shown in Table V, part B. Although the initial number of matches is lower, the amount of information returned is similar (for a given number of results, the partitioning of the results into "correct," "incorrect," and unknown is similar to that seen in Table V, part A). It is clear from these results that there is a very low limit of information that can be derived from these types of canonical searches before the signal-to-noise ratio drops well below an acceptable limit.
Unsupervised Automatic Motif-based Searches Are Similar to Human-mediated BLAST Searches and Can Be Greatly Improved with Minor Parameter Modification
We then took a second approach to finding new methyltransferases using automated motif identification processes. To answer the question as to how good default "out-of-the-box" motif identification and searching is, the 1064 RefSeq matches for the keyword search "methyltransferase" were BLASTed against themselves to return sets in which no entry was homologous to any other entry with a significance greater than a certain expect value. The two expect values used were 10-20 and 10-50, which returned training sets of 289 and 495 sequences, respectively. The motif-searching program MEME (17) was trained with the 10-20 set without parameter modification and used in the default mode to detect five motifs. The matrices obtained were then used by the MAST program (18) to search a yeast-translated ORF database for matches (Table VI, MEME expect 10-20, all motifs). Here, 9 methyltransferases were returned, with 5 false positives and 17 unknowns.
|
Use of the automated MEME-MAST tool set in its default configuration was able to create lists of putative methyltransferases that were similar to those that were obtained by hand-using BLAST and manual sequence inspection (10). The advantage in using the automated tools is that they involve less effort and can therefore be rapidly applied to other genomes. Additionally, the results returned by the MEME-MAST tool set were significantly improved over the manual method by performing a first pass analysis of the results and rerunning the search after removing the non-Motif I confounding elements that were specific to only certain subclasses of methyltransferases or that may represent motifs for distinct types of enzymes such as the related NAD/NADP dehydrogenases.
It is worthwhile to note that a major difference between this method and the canonical method described above is that this method begins with a list of gene sequences, which are then ordered in terms of likelihood of each entry being a methyltransferase. Using default settings, only the top percentage of entries is returned. However, with reduced reporting stringency, the entire genome can be ordered by the likelihood of each ORF being a methyltransferase.
A Less Stringent Training Set Produces Slightly Improved Results When Combined with a Variably Distal Hand-coded Post I Motif: Sensitized Methyltransferase-scoring Matrices (SM2)
The MAST program returns a score on every hit that represents how well the subsequence of the ORF (motif) fits the MEME-derived scoring matrix. However, this score is not directly a probability of the gene product of a sequence being a methyltransferase. To compare sets of results returned from MAST, which vary in both order and motif match significance scores, we have arbitrarily chosen a cutoff point at the fifth known incorrect identification. Comparing the results of the 10-20 and 10-50 training sets yields very similar results (Table VI; MEME expect 10-20, Motif I and MEME expect 10-50, Motif I) with the 10-20 results being slightly better (one additional positive match and two additional candidates); this is the more stringent of the two sets.
Noting the highly conserved, albeit degenerate, Post I motif, a set of hand-coded matrices describing the Post I motif was appended to the description of Motif I in an attempt to improve the search sensitivity. The set varied only in the number of score-neutral elements that separated the Motif I and Post I motifs. Two spacings considered were 525 and 1030. The results are shown in Table VI (MEME, expect 10-20, Motif I-[1030]-Post I; MEME, expect 10-20, Motif I-[525]-Post I; MEME, expect 10-50, Motif I-[525]-Post I; and MEME, expect 10-50, Motif I-[1030]-Post I).
The results for all four sets are quite similar to one another and slightly improved over the non-Post I searches (1416 correct identifications and 2728 candidates). Although the ordering of the ORFs was different, the significance of the results was similar based on the number of correct identifications and number of candidates returned for the 525 and 1030 spacing. The 10-50 training set returned slightly better results than the 10-20 training set with two additional correct identifications and one additional candidate ORF.
We describe this optimized scoring system as sensitized matrices for scoring methyltransferases (SM2). The results from the best training set are expanded in Table VII, which represents our new best list of putative methyltransferases in yeast. Descriptions of all the currently known S. cerevisiae methyltransferases are shown in Table VIII.
|
|
As can be seen in Table VII, most candidates with percent correct values of 80 or greater pass as true positives according to PSI-BLAST criteria. Therefore, it appears that percent correct value 80 can be used in most cases as a safe threshold for automatic functional assignments. However, this analysis also showed that two candidates with lower percent correct values (YDR083w and YLR285w) are likely to be true positives, cautioning against the strict threshold. Finally, three ORFs not originally included in Table VII (YDR120c, YNL022c, and YBR141c) were identified as potential methyltransferases because other queries matched them at an E-value <1e-5. These proteins were subsequently used as queries with the non-redundant protein database and fulfilled the criteria outlined above for inclusion in the methyltransferase superfamily.
Sequence comparisons with HMMer tools and the Pfam 8.0 database provided further support for slightly more than half of PSI-BLAST true positives but were ultimately less informative than the SM2 method described here despite the fact that Pfam 8.0 contains HMMs for more than 30 methyltransferase families, including some families that are presently annotated as uncharacterized.3 Although it is formally possible that some true positives from the SM2 and PSI-BLAST searches represent false predictions and as such were not confirmed by HMMer, it is clear that the coverage of the methyltransferases superfamily in Pfam 8.0 is far from reaching saturation.
SM2 Methodologies Are Easily Applied to Other Genomes and Show Results Similar to Those Seen in S. cerevisiae
To generalize these results, translated ORFs from six additional recently sequenced genomes (human, mouse, Drosophila, Caenorhabditis elegans, Arabidopsis, and Escherichia coli) were ordered based on likelihood of being a methyltransferase using the MAST tool in the SM2 configuration with the "expect 10-50, Motif I-[1030]-Post I" criterion described in Table VI. Lists of putative methyltransferases generated by this method are in the on-line supplement to this paper.
The methods developed here appear to have similar success in finding methyltransferases in these other genomes. The efficacy of ordering the genome in terms of likelihood of being a methyltransferase is shown graphically in Figs. 3 and 4. After the genome is ordered in this fashion, one can look at the genes of known function and develop an overall cumulative percent methyltransferases expression that is similar to the scoring methodology used earlier in this article and shown graphically in Fig. 3. A possibly more telling view is to, at each point in the genome, look at the local percent methyltransferases, that is, what is the percent methyltransferases in a small window surrounding the position we are looking at. Fig. 4 shows this graphically using a window size of 0.01% of the genome size. As can be seen, the percent likelihood of finding a methyltransferase rapidly falls off after the top scores in 23% of the genome are analyzed.
|
|
The final calculation in this section is the prediction of the total number of motif-bearing methyltransferases in a given genome. This calculation was done by taking the data from Fig. 3 and performing the following calculation for each point
![]() |
![]() |
This data is plotted in Fig. 5. It is predicted from the graph that all the genomes assayed have a similar percentage (0.61.6%) of genes that are of the Class I motif form of methyltransferases.
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Including an ORF in a list of putative methyltransferases is obviously only a first step toward biochemically characterizing a new AdoMet-dependent methyltransferase. Even if we had a perfect method that identified all the AdoMet-dependent genes in a genome, we would still need to determine what their methyl-accepting substrates were to define their biological function. As enzymatic activity specification is the slow step in this process, it is sufficient at this point to have a partial list with even marginal confidence that each entry in the list is a methyltransferase. Having a list of 100 ORFs where each entry is 50% likely to be a methyltransferase is much better than having an entire genome ORF list where each entry is only 12% likely to be a methyltransferase. As time progresses and these early lists are exhausted, better techniques will hopefully evolve for protein identification that will allow establishing a complete catalog of the methyltransferase complement of an organism.
In the end, only time will tell if we have, in fact, generated here "good" lists of candidate methyltransferases. We can say at this point, however, that our methodology does appear to be superior to that presently employed in a database such as Pfam (21). For example, of the 24 experimentally verified yeast methyltransferases described in Table VII, eight are not annotated as methyltransferases in version 8.0 of the Pfam database. Additionally, we note that the SM2 methodology used here has identified six new candidates in the "100%" region and 33 new candidates in the "10042%" region of Table VII that were not detected in the 1999 analysis of yeast proteins (10). We have been pleased to see a steady progression of our best yeast candidates into the class of experimentally supported methyltransferases. For example, just in the time between the completion of this manuscript and its revision, two of our high scoring candidates were identified as specific methyltransferases (15, 16). Further evidence of this progress is that in 1999 only seven Class I methyltransferases had been described in yeast (10); the present number is 26 (Table VIII)! We note that the methods described here are only designed to reveal the Class I seven ß-strand family of methyltransferases. Further work will be needed to analyze the Class II (SET) enzymes and the Class III (membrane-bound) enzymes. From the compilation in Table VIII of the 38 presently identified yeast methyltransferases, 26, or 68%, are of the Class I type.
Based on our results, it appears that we may have reached the limit of what is possible with the SM2 methodology presented. Doubling the training set had minimal effect on the results. When we included information from the motif Post I, we did increase the number of correct positive identifications but only marginally improved the number of candidate methyltransferases returned above the 5-false positive threshold used in this study. It is clear that SM2 may weakly score some methyltransferases (false negatives) because the motifs are divergent or because the spacing between them is different from the canonical spacing.
So how can these results be improved further? The next logical step would be the incorporation of countertraining sets using the false positive results to create a feature set that could be recognized and used to downgrade ORFs that had similar features. For example, many of the false positives either fit into a class of enzymes that could be identified (e.g. dehydrogenases or nucleotide-binding proteins) or were highly homologous and could be eliminated on that basis (e.g. the HXT proteins). Another avenue we are currently exploring is the use of motif-based profile HMMs that would automate functional assignments and provide more stringent statistical criteria for distinguishing true versus false positives.3 Despite these limitations, we now have a list of unidentified ORFs for which we are highly confident that a majority of the members will ultimately be characterized as methyltransferases.
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, July 18, 2003, DOI 10.1074/mcp.M300037-MCP200
1 The abbreviations used are: AdoMet, S-adenosylmethionine; MSD, methyltransferase-specific database; SM2, sensitized matrices for scoring methyltransferases; ORF, open reading frame; SGD, Saccharomyces Genome Database; SQL, structured query language.
2 Dolinski, K., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hong, E. L., Issel-Tarver, L., Sethuraman, A., Theesfeld, C. L., Binkley, G., Lane, C., Schroeder, M., Dong, S., Weng, S., Andrada, R., Botstein, D., and Cherry, J. M., Saccharomyces Genome Database at genome-www.stanford.edu/Saccharomyces/.
3 M. Dlaki, unpublished results.
* This work was supported by National Institutes of Health Grants GM26020 and AG18000 (to S. C.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
S The on-line version of this article (available at http://www.mcponline.org) contains Supplement I.
¶ Special Fellow of the Leukemia and Lymphoma Society.
|| To whom correspondence should be addressed: UCLA Molecular Biology Institute, 611 Charles E. Young Dr. East, Los Angeles, CA 90095-1570. Tel.: 310-825-8754; Fax: 310-825-1968; E-mail: clarke{at}mbi.ucla.edu
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|