Rational structural genomics: affirmative action for ORFans and the growth in our structural knowledge

Daniel Fischer1

Faculty of Natural Science, Department of Mathematics and Computer Science, Ben Gurion University, Beer-Sheva 84015, Israel


    Introduction
 Top
 Introduction
 References
 
The determination of the complete genome sequences of organisms is producing an avalanche of protein sequences awaiting further structural and functional interpretation. Only a small fraction of the proteins encoded in these genomes has been experimentally studied, but putative functions for roughly 70% of the ORFs can be assigned via homology with characterized proteins in the databases. Similarly, although only a very small number of structures have been determined for these proteins, putative three-dimensional (3D) structures can currently be assigned to roughly 30% of the ORFs using fold assignment computational methods. Here I address the following questions. How fast is our structural knowledge growing? What is the distribution of assigned folds in the different functional categories? How might structure determination efforts be prioritized for maximum information and impact?

I have analyzed the 3D fold assignments for the genome of Mycoplasma genitalium (Fraser et al., 1995Go), which due to its small size has served as a minimal model organism for various studies. Several publications have reported different fractions of the genome for which 3D folds can be assigned. The earliest works reported fractions as low as 9 and 12% (Casari et al., 1996Go; Frishman and Mewes, 1997Go; Gerstein, 1997Go). Later works using methods aimed at detecting more distant relationships have increased this fraction to 25% (Fischer and Eisenberg, 1997Go), and more recently, up to around 40% (Huynen et al., 1998; Rychlewski et al., 1988; Teichmann et al., 1998; Jones, 1999; Wolf et al., 1999 and others; for recent reviews on this topic see Fischer and Eisenberg, 1999a; Teichmann et al., 1999). The differences in the reported fractions depend mainly on (i) the methods' sensitivities (the rate of true positives) and their selectivities (the rate of false positives); (ii) whether assignments are accounted for full structural domain matches or for only small sequence–structure segments and (iii) the date that the study was done (which determines the number of known sequences and structures and hence the number of sequences that can be assigned to known folds).

To evaluate how much the increase in the fraction of assignable ORFs depends on the number of available folds, I have compared the fold assignment of M.genitalium proteins obtained by one particular method using three different sets of structures. The method used in this comparison (Fischer and Eisenberg, 1997Go) is aimed at detecting full structural domain matches and uses rather conservative thresholds (the method chosen to carry out this comparison is irrelevant; qualitatively similar results are likely to be obtained with any other method). When using only those structures available before 1996 only 20% of the genome could be assigned a fold. With structures from the PDB available in April 1997, 25% of the genome was assigned a fold (Fischer and Eisenberg, 1997Go). When using all the structures available in October 1998, the fraction of assigned proteins reached 32% (see http: //www.doe-mbi.ucla.edu/people/frsvr/preds/MG/MG.html).This indicates that because of the availability of more structures, the fraction of assignable ORFs has increased at an annual rate of roughly 18% (Fischer and Eisenberg, 1999a; see also Teichmann et al., 1999 and references therein).

Will the rate of increase in fold assignment be sustained throughout the next few years? To address this question, I have analyzed the distribution of the fold assignments of M.genitalium among the various functional categories described by Fraser et al. (1995). Table IGo shows that the three categories with the largest percentages of folds assigned are purine metabolism, energy metabolism and translation-tRNA. For example, all but two ORFs in the first category have been assigned a fold. As expected, and mostly due to the difficulties in determining the structures of membrane proteins, the three least covered categories are cell envelope, unknown and transport. The last column in Table IGo shows that the largest number of non-membrane proteins with no assigned fold belong in the unknown and ribosomal categories (ORFs characterized as membranal or with putative transmembrane helices were excluded).


View this table:
[in this window]
[in a new window]
 
Table I. Distribution of the fold assignment of M.genitalium according to functional categories
 
The fraction of assignable ORFs will undoubtedly continue to grow in the next few years, because new structures will continue to be determined in most of the functional categories. However, because in several functional categories only a few ORFs lack structural assignments, if structure determination continues to concentrate on the best represented categories, the fraction of assignable ORFs will soon reach a plateau. A `rational' approach to structural genomics (Fischer and Eisenberg, 1997Go; Kim, 1997Go; Rost, 1998Go; Teichmann et al., 1999Go) could significantly advance our knowledge by selecting for structural determination studies those proteins in the categories with fewer assigned structures. 153 ORFs (32% of the genome) belong to the unknown category, of which 97 [21% of the genome or 43% (97/224) of the unassigned ORFs] correspond to soluble proteins with no functional or structural information whatsoever. Roughly half of them match proteins of unknown function from other organisms, indicating that they are conserved proteins in various organisms. The other half of the ORFs in the `unknown' category show no sequence similarity to any protein of other organisms (excluding the close relative M.pneumonia). If these orphan ORFs, or ORFans for short (Fischer and Eisenberg, 1999bGo), code for expressed proteins (Dujon et al., 1994Go; Goffeau et al., 1996Go), they will correspond to unique proteins with novel functions or to very distant members of known families. Thus, these ORFans are likely to be among the most interesting targets for further structural and functional studies. Characterizing ORFans (Fischer and Eisenberg, 1999bGo) and conserved proteins of unknown function (Zarembinski et al., 1998Go) will be essential to fully understand the genetic material. Knowing their structures will considerably contribute to our understanding of protein structure, function and evolution.


    Notes
 
1 To whom correspondence should be addressed; email: dfischer{at}cs.bgu.ac.il Back


    References
 Top
 Introduction
 References
 
Casari,G., Ouzounis,C., Valencia,A. and Sander,C. (1996) GeneQuiz II: Automatic Function Assignment for Genome Sequence Analysis. In First Annual Pacific Symposium on Biocomputing. World Scientific, Hawaii, pp. 707–709.

Dujon,B. et al. (1994) Nature, 369, 371–377.[ISI][Medline]

Fischer,D. and Eisenberg,D. (1997) Proc. Natl Acad. Sci. USA, 94, 11929–11934.[Abstract/Free Full Text]

Fischer,D. and Eisenberg,D. (1999a) Curr. Opin. Struct. Biol., 9, 208–211.[ISI][Medline]

Fischer,D. and Eisenberg,D. (1999b) Bioinformatics, 15, 759–762.[Free Full Text]

Fraser,C. et al. (1995) Science, 270, 397–403.[Abstract]

Frishman,D. and Mewes,H.-W. (1997) Nature Struct. Biol., 4, 626–628.[ISI][Medline]

Gerstein,M. (1997) J. Mol. Biol., 274, 562–576.[ISI][Medline]

Goffeau,A. et al. (1996) Science, 274, 546–547.[Abstract/Free Full Text]

Huynen,M., Doerks,T., Eisenhaber,F., Orengo,C., Sunyaev,S., Yuan,Y. and Bork,P. (1998) J. Mol. Biol., 280, 323–326.[ISI][Medline]

Jones,D. (1999) J. Mol. Biol., 287, 797–815.[ISI][Medline]

Kim,S.H. (1997) Nature Struct. Biol., 5, 643–645.[ISI]

Rost,B. (1998) Structure, 6, 259–263.[ISI][Medline]

Rychlewski,L., Zhang,B. and Godzik,A. (1998) Folding Des., 3, 229–236.[ISI][Medline]

Teichmann,S., Park,J. and Chothia,C. (1998)Proc. Natl Acad. Sci. USA,95, ???-???.[Medline]

Teichmann,S., Chothia,C. and Gerstein,M. (1999) Curr. Opin. Struct. Biol., 9, 390–399.[ISI][Medline]

Wolf,Y., Brenner,S., Bash,P. and Koonin,E. (1999) Genom. Res., 9, 17–26.[Abstract/Free Full Text]

Zarembinski,T. et al. (1998) Proc. Natl Acad. Sci. USA, 95, 15189–15193.[Abstract/Free Full Text]

Received June 28, 1999; revised September 9, 1999; accepted September 9, 1999.