Unité de Génétique Moléculaire Bactérienne, Institut Pasteur, 28 rue du Docteur Roux, 75724 Paris Cedex, France1
Annotation-Bases de Données (PT4), Génopole, Institut Pasteur, Paris, France2
Génoscope/UMR 8030, Atelier de Génomique Comparative, 2 rue Gaston Crémieux, 91006 Evry Cedex, France3
Author for correspondence: Stewart T. Cole. Tel: +33 1 45688446. Fax: +33 1 40613583. e-mail: stcole{at}pasteur.fr
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: mycobacteria, tuberculosis, genomics
Abbreviations: CDS, protein-coding sequences
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Mycobacterium tuberculosis H37Rv was first isolated in 1905, has remained pathogenic and is the most widely used strain in tuberculosis research. The complete genome sequence and annotation of this strain was published in 1998 (Cole et al., 1998 ). The information from this project was incorporated into the public database TubercuList (http://genolist.pasteur.fr/TubercuList/) which was created using the GenoList model (Moszer et al., 2002
).
In this paper we describe the re-annotation of the M. tuberculosis H37Rv genome. We have manually re-evaluated each of the coding sequences (CDS) previously annotated and present the combined results of recent database searches and literature surveys. This annotation also contains new comparisons with the recently completed genome sequence of Mycobacterium leprae (Cole et al., 2001 ).
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Identification of new CDS.
Three independent approaches were used for detecting new potential coding sequences. In the first, some new CDS with appropriate GC content, correlation scores and codon usage were found manually during the re-annotation of the genome. In the second, a new program, AMIGA (Automatic MIcrobial Genome Annotation), was used to identify possible frameshifts and potential coding sequences that had been overlooked (for details see Bocs et al., 2002 ). Briefly, AMIGA found the most likely CDS longer than 60 bp and merged the results with those generated by a modified GeneMark analysis. The combined results were then compared with the original annotation and the additional CDS detected by AMIGA investigated further using the criteria in the first approach and database searches. In the third approach, other CDS were found following TBLASTN searches of TubercuList using protein sequence data from the literature (Jungblut et al., 2001
; Rosenkrands et al., 2000a
; Corixa Corporation patent no. WO 97/09428, 1997) or personal communications (N. Stoker, P. Jungblut).
![]() |
RESULTS AND DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Using AMIGA, an in-house program employing M. tuberculosis codon usage (see Methods), we have found 30 other CDS whose putative products comprise between 45 and 211 aa (Table 1; see AMI). With these different methods, nine new CDS replaced genes identified in the original annotation, but localized on the opposite strand. In these cases, the same Rv number has been kept but the letter extension c, denoting localization on the complementary strand, has been added or removed as appropriate. AMIGA also predicts real or potential frameshifts, and these results led to the correction of four sequencing errors. The current nucleotide sequence now contains 4411532 nt.
Seven other CDS were identified using data from the literature. Thus, the proteomic study of Jungblut et al. (2001) identified five new proteins by two-dimensional electrophoresis and mass spectrometry. Two other CDS were uncovered using the results of antigen discovery programmes of the Corixa Corporation (Patent no. WO 97/09428, 1997) and the Statens Serum Institute (Rosenkrands et al., 2000a
), respectively.
All new CDS have been classified into one of the 11 functional classes used by Cole et al. (1998) and described in Table 2
. For 60 of the putative proteins we were unable to predict a precise function and this group includes proteins classified as conserved hypothetical proteins (class 10), unknown proteins without homology (class 8), proteins putatively involved in cell-wall processes or localized in the membrane (class 3) and two PE family proteins (class 6). Functions were predicted for 22 of the new proteins, e.g. RpmF or KdpF as discussed previously and classified in classes 2 (information pathways) or 3 (cell-wall and cell processes), respectively.
|
In addition, during the new analysis, the lengths of 60 coding sequences have been altered based on the results of BLASTP/FASTA and MEME/MAST or due to correction of the nucleotide sequence (Table 1). Thus 35 CDS have been shortened and 15 extended at their 5' ends, whereas 10 have been altered at their 3' ends due to sequence corrections or errors in the original annotation. One gene, Rv2233, has been removed, but not replaced as this merged with the preceding CDS, Rv2232. When the coding sequence of a previously determined gene has been shortened, we have checked for the possible presence of a new CDS in another reading frame. So, for example, after shortening the smc gene, encoding a putative chromosome segregation protein, the acyP gene (Rv2922A) was identified.
Changes to functional classes
The functional classification of 643 of the predicted proteins of M. tuberculosis H37Rv identified in 1998 has been changed during the update (Fig. 1a). Not unexpectedly, the two functional classes exhibiting the greatest number of transfers to other classes were the unknown category with 354 changes (class 8) and the conserved hypotheticals category with 183 changes (class 10). Figs 1(b
, c
) show the functional categories to which the protein-coding genes annotated as unknown and conserved hypothetical proteins in 1998 have moved.
|
Many of the class 10 and class 8 proteins have been reclassified in the cell wall and cell process category (class 3) (Fig. 1b, c
). There are two reasons for these changes. First, the criteria for classification in this group (class 3) have been amended since 1998. Class 3 now comprises all predicted membrane proteins or proteins believed to be involved in a cell process (including secreted and transmembrane proteins), regardless of whether they have similarities with other proteins or a predicted function. For example in 1998, Rv0970, encoding an integral membrane protein with no similarity to other proteins, was classified as unknown (class 8). Second, some of the predicted proteins have moved to the cell-wall and cell processes group because of new information from the databases or new data from research. For example, Rv2450c is believed to encode a bacterial growth factor or cytokine involved in promoting the resuscitation and growth of dormant cells (Mukamalova et al., 1998
) and has been renamed rpfE (M. Young, personal communication).
There have also been transfers from the other functional groups as follows (Fig. 1a): 63 transfers from class 7 (intermediary metabolism and respiration), 24 transfers from class 3, 11 transfers from class 9 (regulatory proteins), 6 transfers from class 1 (lipid metabolism), 1 transfer from class 0 (virulence, detoxification or adaptation) and 1 transfer from class 2 (information pathways). These transfers often involve a change from a predicted function in one class to a more precise function in another (e.g. 7 to 1), but can also involve a regression. For example, Rv3522 was predicted to be a transcriptional regulatory protein in the first annotation (class 9), but re-analysis of its sequence leads us to consider it to be involved in lipid metabolism (class 1). The total number of regressions was 82, the largest group involving transfer from class 7 to 10. The original annotation of these putative proteins involved either incorrect analysis of the results or the existence of errors in the database used. For example, the CDS Rv0382c was originally annotated as a probable uridine 5'-monophosphate synthase based on the best similarities after BLASTP or FASTA analysis. However, this assignment is misleading as uridine 5'-monophosphate synthase is a bifunctional enzyme containing orotate phosphoribosyltransferase activity (EC 2.4.2 . 10) at its N terminus and orotidine 5'-phosphate decarboxylase activity (EC 4.1.1.23) at its C terminus. Several proteins annotated as uridine 5'-monophosphate synthases in the databases share similarity only at their N terminus and these are therefore incorrectly annotated because they lack the second domain found in authentic uridine 5'-monophosphate synthases. We thus consider the Rv0382c product to be an orotate phosphoribosyltransferase and not a uridine 5'-monophosphate synthase as predicted in the first annotation. Note that in certain regressions, gene names attributed in 1998 have been removed.
Changes within functional classes
Updating the genome annotation of M. tuberculosis H37Rv has also resulted in many changes within the functional classes usually due to new information from the literature. These have included updating the product names, changing 58 specific gene names and introducing appropriate new citations. Gene names were included when there was significant similarity or a pertinent publication. For example, the previously annotated umaA2 (Rv0470c), unknown mycolic acid synthase, has recently been changed to pcaA, encoding an S-adenosyl methionine (SAM)-dependent methyl transferase, required for -mycolic acid cyclopropanation and lethal chronic persistence in M. tuberculosis infection (Glickman et al., 2000
). On occasion, a gene name has been given to a CDS which previously had just an Rv number, e.g. Rv0981 is identified now as mpr (Zahrt & Deretic, 2001
).
Changes have also been generated from new sequences in the databases and from detailed studies of synteny, particularly with the related pathogen M. leprae. For example, Rv1860 was previously known as apa, encoding a 4547 kDa secreted antigen (Laqueyrerie et al., 1995 ), located at the end of the modABC operon (molybdate transport system). Its name was recently changed to modD, on the basis of its proximity to modABC, and it is now part of the ModD family described by SWISS-PROT (e.g. P46842, Q50906). However, the protein which has been shown to be glycosylated and to have fibronectin-binding activity (Schorey et al., 1995
) shares no significant sequence similarity with other proteins involved in molybdate uptake. Furthermore, all of the functions involved in molybdate transport, and the enzymes that synthesize or require molybdopterin for activity, have been inactivated or lost from M. leprae (Eiglmeier et al., 2001
) with the exception of modD. This strongly suggests that ModD does not participate in molybdate uptake and we propose, therefore, that the name apa should be maintained.
Functional distribution of predicted CDS in 1998 and 2002
The new genomic annotation of M. tuberculosis H37Rv has incorporated many changes to the functional classifications of the predicted proteins. A comparison of the number of predicted proteins in each of the functional categories between 1998 and 2002 is shown in Table 2. An important change has been the decrease in the number of unknown proteins from 606 to 272. Presently we are able to predict a function for 2058 proteins (52% of the proteome) and more than 150 of these have been experimentally proven in mycobacterial research. The number of conserved hypothetical proteins has changed from 910 in 1998 to 1051 today. 376 putative proteins show no similarity to known proteins from other organisms and some of them may be specific to M. tuberculosis. To date, more than 400 M. tuberculosis proteins have been detected experimentally, most of them by proteomic studies (Weldingh et al., 1998
; Jungblut et al., 1999
; Mollenkopf et al., 1999
; Rosenkrands et al., 2000b
; Betts et al., 2000
). In the coming years, the number of unknown proteins should continue to decrease as more similarities are found by database searches or as functions are identified for some of these potentially M. tuberculosis-specific proteins. The structural genomics programmes currently under way on mycobacteria should have a significant impact in this respect (http://www.doe-mbi.ucla.edu/TB/; http://www.pasteur.fr/recherche/X-TB/).
![]() |
ACKNOWLEDGEMENTS |
---|
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Bentley, S. D., Chater, K. F., Cerdeno-Tarraga, A. M. & 40 other authors (2002). Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417, 141147.[Medline]
Betts, J. C., Dodson, P., Quan, S., Lewis, A. P., Thomas, P. J., Duncan, K. & McAdam, R. A. (2000). Comparison of the proteome of the Mycobacterium tuberculosis strain H37Rv with clinical isolate CDC1551. Microbiology 146, 3205-3216.
Bocs, S., Danchin, A. & Medigue, C. (2002). Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes. BMC Bioinformatics 3, 1-5.[Medline]
Braibant, M., Gilot, P. & Content, J. (2000). The ATP binding cassette (ABC) transport systems of Mycobacterium tuberculosis. FEMS Microbiol Rev 24, 449-467.[Medline]
Cole, S. T., Brosch, R., Parkhill, J. & 39 other authors (1998). Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537544.[Medline]
Cole, S. T., Eiglmeier, K., Parkhill, J. & 40 other authors (2001). Massive gene decay in the leprosy bacillus. Nature 409, 10071011.[Medline]
Dandekar, T., Huynen, M., Regula, J. T. & 10 other authors (2000). Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Res 28, 32783288.
Eiglmeier, K., Parkhill, J., Honore, N., & 12 other authors (2001). The decaying genome of Mycobacterium leprae. Lepr Rev 72, 387398.[Medline]
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C. J., Hofmann, K. & Bairoch, A. (2002). The PROSITE database, its status in 2002. Nucleic Acids Res 30, 235-238.
Fleischmann, R. D., Adams, M. D., White, O. & 37 other authors (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496512.[Medline]
Gaasterland, T. & Oprea, M. (2001). Whole-genome analysis: annotations and updates. Curr Opin Struct Biol 11, 377-381.[Medline]
Gassel, M., Mollenkamp, T., Puppe, W. & Altendorf, K. (1999). The KdpF subunit is part of the K(+)-translocating Kdp complex of Escherichia coli and is responsible for stabilization of the complex in vitro. J Biol Chem 274, 37901-37907.
Glickman, M. S., Cox, J. S. & Jacobs, W. R.Jr (2000). A novel mycolic acid cyclopropane synthetase is required for cording, persistence, and virulence of Mycobacterium tuberculosis. Mol Cell 5, 717-727.[Medline]
Jungblut, P. R., Schaible, U. E., Mollenkopf, H. J. & 7 other authors (1999). Comparative proteome analysis of Mycobacterium tuberculosis and Mycobacterium bovis BCG strains: towards functional genomics of microbial pathogens. Mol Microbiol 33, 11031117.[Medline]
Jungblut, P. R., Muller, E. C., Mattow, J. & Kaufmann, S. H. (2001). Proteomics reveals open reading frames in Mycobacterium tuberculosis H37Rv not predicted by genomics. Infect Immun 69, 5905-5907.
Kersten, M. A., Muller, Y., Baars, J. J., Op den Camp, H. J., van der Drift, C., Van Griensven, L. J., Visser, J. & Schaap, P. J. (1999). NAD+-dependent glutamate dehydrogenase of the edible mushroom Agaricus bisporus: biochemical and molecular characterization. Mol Gen Genet 261, 452-462.[Medline]
Laqueyrerie, A., Militzer, P., Romain, F., Eiglmeier, K., Cole, S. T. & Marchal, G. (1995). Cloning, sequencing, and expression of the apa gene coding for the Mycobacterium tuberculosis 45/47-kilodalton secreted antigen complex. Infect Immun 63, 4003-4010.[Abstract]
Lu, C. D. & Abdelal, A. T. (2001). The gdhB gene of Pseudomonas aeruginosa encodes an arginine-inducible NAD(+)-dependent glutamate dehydrogenase which is subject to allosteric regulation. J Bacteriol 183, 490-499.
Minambres, B., Olivera, E. R., Jensen, R. A. & Luengo, J. M. (2000). A new class of glutamate dehydrogenases (GDH). Biochemical and genetic characterization of the first member, the AMP-requiring NAD-specific GDH of Streptomyces clavuligerus. J Biol Chem 275, 39529-39542.
Mollenkopf, H. J., Jungblut, P. R., Raupach, B., Mattow, J., Lamer, S., Zimny-Arndt, U., Schaible, U. E. & Kaufmann, S. H. (1999). A dynamic two-dimensional polyacrylamide gel electrophoresis database: the mycobacterial proteome via Internet. Electrophoresis 20, 2172-2180.[Medline]
Moszer, I., Jones, L. M., Moreira, S., Fabry, C. & Danchin, A. (2002). SubtiList: the reference database for the Bacillus subtilis genome. Nucleic Acids Res 30, 62-65.
Mukamolova, G. V., Kaprelyants, A. S., Young, D. I., Young, M. & Kell, D. B. (1998). A bacterial cytokine. Proc Natl Acad Sci USA 95, 8916-8921.
Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10, 1-6.[Abstract]
Paulsen, I. T., Nguyen, L., Sliwinski, M. K., Rabus, R. & Saier, M. H.Jr (2000). Microbial genome analyses: comparative transport capabilities in eighteen prokaryotes. J Mol Biol 301, 75-100.[Medline]
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85, 2444-2448.[Abstract]
Riley, M. (1993). Functions of the gene products of Escherichia coli. Microbiol Rev 57, 862-952.[Abstract]
Rosenkrands, I., Weldingh, K., Jacobsen, S., Hansen, C. V., Florio, W., Gianetri, I. & Andersen, P. (2000a). Mapping and identification of Mycobacterium tuberculosis proteins by two-dimensional gel electrophoresis, microsequencing and immunodetection. Electrophoresis 21, 935-948.[Medline]
Rosenkrands, I., King, A., Weldingh, K., Moniatte, M., Moertz, E. & Andersen, P. (2000b). Towards the proteome of Mycobacterium tuberculosis. Electrophoresis 21, 3740-3756.[Medline]
Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M. A. & Barrell, B. (2000). Artemis: sequence visualization and annotation. Bioinformatics 16, 944-945.[Abstract]
Schorey, J. S., Li, Q., McCourt, D. W., Bong-Mastek, M., Clark-Curtiss, J. E., Ratliff, T. L. & Brown, E. J. (1995). A Mycobacterium leprae gene encoding a fibronectin binding protein is used for efficient invasion of epithelial cells and Schwann cells. Infect Immun 63, 2652-2657.[Abstract]
Serres, M. H., Gopal, S., Nahum, L. A., Liang, P., Gaasterland, T. & Riley, M. (2001). A functional update of the Escherichia coli K-12 genome. Genome Biol 2, 0035.10035.7.
Sonnhammer, E. L., von Heijne, G. & Krogh, A. (1998). A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 6, 175-182.[Medline]
Tekaia, F., Gordon, S. V., Garnier, T., Brosch, R., Barrell, B. G. & Cole, S. T. (1999). Analysis of the proteome of Mycobacterium tuberculosis in silico. Tuber Lung Dis 79, 329-342.[Medline]
Weldingh, K., Rosenkrands, I., Jacobsen, S., Rasmussen, P. B., Elhay, M. J. & Andersen, P. (1998). Two-dimensional electrophoresis for analysis of Mycobacterium tuberculosis culture filtrate and purification and characterization of six novel proteins. Infect Immun 66, 3492-3500.
Zahrt, T. C. & Deretic, V. (2001). Mycobacterium tuberculosis signal transduction system required for persistent infections. Proc Natl Acad Sci USA 98, 12706-12711.
Received 2 April 2002;
revised 27 June 2002;
accepted 18 July 2002.