Correspondence to: Mark E. Fortini, Department of Genetics, University of Pennsylvania School of Medicine, Stellar-Chance Laboratories, Room 709C, 422 Curie Boulevard, Philadelphia, PA 19104. Tel:(215) 573-6446 Fax:(215) 573-9411 E-mail:fortini{at}mail.med.upenn.edu.
The recent sequencing of the Drosophila genome as a collaborative effort between the Berkeley Drosophila Genome Project (BDGP)1 and Celera Genomics provides an unparalleled opportunity to assess the prevalence of human disease gene counterparts in the fly genome (
![]() |
Survey Design and Methodology |
---|
![]() ![]() ![]() ![]() |
---|
Constructing the Human Disease Gene List
The core component of our survey is a list of 287 human disease genes representing several different classes of diseases, including cancer, neurological diseases, cardiovascular diseases, malformation syndromes, hematological, immune, endocrine, renal, and metabolic disorders (Table 1). This list was compiled by scanning the Online Mendelian Inheritance in Man database (OMIM; http://www.ncbi.nlm.nih.gov/omim/) as well as medical textbooks and review articles listing classes of human disease genes. The criterion for inclusion in the final list was that the human gene must actually be mutated, altered, amplified, or deleted in human subjects with the disease. From our initial set of >800 human genes associated with diseases, over half were eliminated because they did not meet this criterion. Genes potentially linked to a human disease solely by cell culture experiments, yeast two-hybrid interaction screens, model organism studies, or similar approaches were excluded from our analysis. Each human disease gene on the final list was confirmed by checking OMIM or published literature sources, and was placed in the most relevant disease category on the list. For human disease genes in which different paralogs have been associated with disease, such as Ras family members, rhodopsins, and some HOX and PAX gene family members, a single example was chosen to represent the group and redundant paralogs were eliminated from the list. In some cases, assignment of a gene to a particular category was somewhat arbitrary, since altered gene function may result in different diseases or a syndrome characterized by multiple organ involvement or a complex pathophysiology. For example, human Notch gene mutations cause both cancer (Notch1 rearrangements in T cell acute lymphoblastic leukemia) and neurological disease (Notch3 point mutations in CADASIL). The final list of 287 human disease genes is not meant to be comprehensive; in fact, there are currently estimated to be 1,000 human disease genes defined by at least one allelic variant each (
|
Bulk BLAST Searches and Analysis
The initial step in the survey was to perform bulk BLASTP searches in which the 287 human disease protein sequences were used as queries to search a database consisting of all predicted proteins encoded by the yeast, worm, and fly genomes (38,860 total sequences;
These BLAST searches provided us with an initial set of BLAST scores and amino acid sequence alignments of each human disease protein to the top five matches in each of the three fully sequenced genomes. To assess the likelihood that a given human disease protein has a counterpart encoded in the fly genome, each set of alignments was visually inspected and compared with one another across the different species. Since the bulk BLAST searches were done with an aggregate set of 38,860 target sequences representing all yeast, worm, and fly proteins, using a normalized database size setting equal to the largest possible proteome size for all three genomes combined (the z parameter in BLAST, derived by adding the nucleotide lengths of all the genomes and dividing by three), the resulting E values in the different organisms could be compared directly despite their nonequivalent genome sizes. Cross-species inspection of the best alignments for each query protein aided our judgments by revealing whether specific protein domains tend to be highly conserved in all species, whether the fly possesses a much better match to the query than do yeast and worm, and whether a clear pattern of conserved residues can be discerned in the best alignments from the different species. In addition, the overall domain compositions of some fly proteins were compared with their human query proteins using InterPro, a database of protein sequence motifs (
Evaluation of Questionable Cases
The above strategy allowed ~200 of the 287 genes on the list to be scored as present or absent in the fly genome in a relatively straightforward manner. For ~90 genes, however, the analysis was more problematic. For some of these cases, the query protein sequence was very short or the level of amino acid identity in the best BLAST alignments was not very impressive, resulting in a poor E value. A particularly striking example is the fly p53 gene, which exhibits a low degree of similarity to human p53 family members (E value = 2 x 10-8), but which was judged to be a homologue because it shows a conserved organization of functional domains, and its DNA-binding domain contains conserved residues at positions that are commonly mutated in human cancers (
|
A common problem that arose in the large-scale BLAST searches was that many human disease proteins belong to large families of closely related yet functionally distinct proteins, such as kinases, cell adhesion molecules, and certain classes of transcription factors. Inspection of BLAST scores and alignments was often insufficient to determine whether the most related protein detected in a given species represents a true homologue or merely a related protein. Drosophila possesses ~300 protein kinases and >100 homeodomain proteins, for example, and identification of many of these proteins as exact homologues to known proteins in other organisms is often not possible (
Limitations of the Analysis
In any large-scale computational survey of this type, it is inevitable that some inaccuracies and errors will arise. Although we performed extensive manual cross-checking of our results, our effort should be considered a first-pass survey of human disease-related loci in Drosophila, which will be corrected and refined as the completed fly genome sequence is further analyzed. Certain aspects of the sequence data would be expected to cause potential homologues to be missed. The genome sequence currently contains >1,000 relatively small gaps, some of which could contain disease gene homologues that we failed to detect. The Machado-Joseph disease gene (SCA3/MJD) has an identified homologue in C. elegans but not in D. melanogaster, for example, suggesting that a fly counterpart could exist but might have gone undetected in our survey. Furthermore, the gene prediction algorithms used to analyze the fly genome sequence are known to be error-prone, incorrectly predicting 5' and 3' coding exons and intronexon boundaries, splitting single genes into two or more predicted genes, and merging adjacent genes into a single predicted transcript (
Results of the Survey
In our survey of 287 human disease genes, a total of 178 (62%) were found to have likely homologues in Drosophila. Inspection of the different classes of genes indicates that some categories are better represented than others (Table 1). Categories with a high representation of homologues in Drosophila include the genes for cancer (47 of 65, 72%), neurological diseases (38 of 59, 64%), malformation syndromes (22 of 34, 65%), metabolic diseases (14 of 17, 82%) and renal diseases (11 of 16, 69%). A small number of genes implicated in cardiac diseases all had likely homologues in Drosophila (6 of 6). Underrepresented in the fly genome were likely homologues of the genes implicated in endocrine (12 of 31, 39%) and hematological diseases (8 of 18, 44%) as well as diseases of the immune system (7 of 17, 41%).
Cancer Genes in Drosophila
A high proportion of cancer genes have homologues in Drosophila. Many of these homologues have been identified by workers in the field who have cloned these genes using hybridization to mammalian sequences or more recently by searching for sequence similarity in databases of Drosophila genomic sequence or EST sequences. Our first look at the complete sequence of the genome allowed us to identify additional homologues and also to make tentative statements about classes of cancer genes that appear to be absent in Drosophila. Two groups of cancer genes that appear to be absent in Drosophila are the genes mutated in breast cancer (BRCA1 and BRCA2) and the genes mutated in Fanconi's anemia (FANCA, FANCC, and FANCG), a disease characterized by anemia, chromosomal instability and a predisposition to cancer. Also apparently absent are homologues of mdm2 and p19ARF (or p14ARF) which regulate the levels of the p53 protein in mammalian cells (
One of the genes identified as a result of the sequencing effort is a convincing homologue of the human menin gene (MEN1). Mutations in menin are found in the multiple endocrine neoplasia type 1 syndrome, a familial cancer syndrome characterized by varying combinations of tumors in the parathyroid glands, the pancreatic islets, the anterior pituitary, as well as a variety of other tissues. These tumors often secrete the hormones of the tissue of origin (e.g., insulin and growth hormone). The menin gene encodes a nuclear protein of 610 amino acids which is thought to bind to and inhibit the function of the JunD transcription factor in humans (
|
Also identified in our survey was a Drosophila p53 family member, the sequence and function of which have been described in two recent publications (
Mutations in the two related genes EXT1 and EXT2 have been implicated in human multiple exostosis syndromes, which are characterized by abnormal bony outgrowths. The Drosophila homologue of EXT1 is the tout-velu gene, which has been shown to be necessary for the diffusion of the Hedgehog protein that functions as a morphogen in many tissues (
A likely homologue of the STK11 kinase was also identified (Fig 3). This gene is mutated in Peutz-Jeghers disease (
|
Ataxia telangiectasia (ATM) is a syndrome characterized by loss of coordination (ataxia) as well as multiple cutaneous capillary malformations (telangiectasia). The gene product of the ATM locus has been implicated in activating signaling pathways in response to DNA damage. The Drosophila mei-41 gene has previously been shown to function as an ATM homologue in many respects (
Neurological Genes in Drosophila
Out of 59 human neurological genes surveyed, 38 appear to be conserved in Drosophila (
Also worth noting are the human neurological disease loci that we failed to detect in the fly genome. We were unable to find a fly counterpart of human prion protein gene (PRNP), despite extensive BLAST searches using different prion protein segments, or homologues of the Charcot-Marie-Tooth syndrome 1A and 1B loci, or the Parkinson's disease gene encoding -synuclein. Directed neuronal expression of human
-synuclein in aged transgenic flies leads to the formation of Lewy bodies and other morphological defects reminiscent of human Parkinson's disease (
-synuclein itself may be absent.
Several genes implicated in expanded polyglutamine repeat diseases, including huntingtin, FRDA (Friedreich ataxia), SCA2 and SCA6 (spinal cerebellar ataxia loci) are conserved in the Drosophila genome, although others, such as putative homologues of SCA1, SCA3/MJD (Machado-Joseph Disease) and SCA7 were not found. Transgenic Drosophila models of expanded polyglutamine repeat diseases have already been developed by directed expression of human Huntingtin and SCA3/MJD proteins and shown to reproduce the nuclear inclusions of expanded repeat proteins that are characteristic of this class of illness (
Malformation Syndrome and Metabolic Disorder Genes in Drosophila
Almost two thirds of the genes implicated in malformation syndromes have likely Drosophila homologues. This finding is not surprising since many of these genes function in defining the body plan in the embryo and in patterning specific tissues. Indeed, some of these genes were originally cloned by virtue of their sequence similarity to patterning genes in Drosophila. These include the Sonic hedgehog gene (holoprosencephaly 3) and the Eyes absent 1 (EYA1) gene which is mutated in Melnick-Fraser brachiootorenal dysplasia. Also frequently conserved were genes implicated in diseases caused by abnormalities in metabolism, presumably reflecting a conservation of many of the metabolic pathways between Drosophila and humans. Surprisingly, a likely homologue for the gene implicated in Lesch-Nyhan syndrome, hypoxanthine guanine phosphoribosyl transferase (HPRT1), was not found in the fly genome sequence. C. elegans appears to have an HPRT homologue, and many other enzymes involved in purine biosynthesis in mammals have Drosophila homologues. Our failure to detect a fly HPRT homologue might therefore represent a likely candidate for a gene that lies in one of the small gaps still remaining in the assembled Drosophila genome sequence.
Hematological and Immune Disease Genes in Drosophila
Genes that function in the mammalian immune system and in mammalian blood cells are significantly underrepresented in Drosophila. Not surprisingly, genes that function in acquired immunity, such as RAG1 and RAG2, which are involved in immunoglobulin gene rearrangement, are not found in Drosophila. The absence of RAG-like proteins is consistent with the fact that the Drosophila genome is not known to undergo any programmed DNA rearrangements, unlike several lower organisms such as yeast and bacteria or the mammalian immune system. In contrast, some of the genes that function in signaling pathways in hematopoietic cells, including BTK and JAK3, possess Drosophila homologues. Genes that function in oxygen transport via erythrocytes, such as the hemoglobin genes, and genes involved in blood coagulation do not have Drosophila homologues, reflecting fundamental differences in physiology between the two organisms. Flies do not possess a hemoglobin-based oxygen delivery system or clotting system resembling those of human blood, and they instead rely upon a branching tracheal system in which oxygen is delivered directly to tissues from the atmosphere, and in which soft tissues are protected from injury by a durable exoskeleton. However, human hematological disease genes that encode components or likely regulators of the cytoskeleton, such as the genes for Wiskott-Aldrich syndrome (WAS) and hereditary spherocytosis (ANK1), do possess Drosophila homologues.
Endocrine and Renal Disease Genes in Drosophila
Of the genes implicated in endocrine disorders, many components of insulin-mediated signaling pathways are found in Drosophila. Most other endocrine pathways do not appear to be conserved. Finally, despite major dissimilarities between the vertebrate kidney and insect Malpighian tubules, a significant proportion of genes mutated in congenital renal disorders have Drosophila homologues. Many of these genes encode proteins involved in fluid and electrolyte transport and their identification might encourage the study of Malpighian tubule development and function to gain insight into certain human kidney disorders.
In conclusion, Drosophila appears to represent a particularly good model organism for the study of genes implicated in many cancers, neurological disease, malformation syndromes, metabolic disorders and some renal diseases. Specific endocrine, immunological and hematological disease genes may require vertebrate model organisms such as mice, since relatively few of the known human disease genes in these categories are present in Drosophila. Most promisingly, our search for fruit fly homologues of 287 known human disease genes leads us to conclude that as additional human disease genes are discovered, it is more likely than not that a counterpart will be found in the Drosophila genome.
![]() |
Footnotes |
---|
1 Abbreviations used in this paper: ATM, ataxia telangiectasia; ATR, ATM-related; BDGP, Berkeley Drosophila Genome Project; OMIM, Online Mendelian Inheritance in Man database; VWF, von Willebrand factor; WAS, Wiskott-Aldrich syndrome.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() |
---|
We thank Celera Genomics and the Berkeley Drosophila Genome Project for their hospitality and generosity in providing access to the Drosophila genome sequence data, and to Oxana K. Pickeral, Jiong Zhang, Peter M. Kuehl, Gena Heidary, and Glen A. Seidner for assistance with the protein database searches and sequence alignments.
M.E. Fortini is supported by grants from the National Institutes of Health (R01AG14583), the Alzheimer's Association, the Life and Health Insurance Medical Research Fund, the American Heart Association, and Merck & Co., Inc., and I.K. Hariharan is supported by grants from the National Institutes of Health (R01NS36084 and R01EY11632) and the American Cancer Society.
Submitted: 5 June 2000
Revised: 26 June 2000
Accepted: 26 June 2000
![]() |
References |
---|
![]() ![]() ![]() ![]() |
---|