Disease genes and intracellular protein networks

S. Bortoluzzi1, C. Romualdi2, A. Bisognin1 and G. A. Danieli1

1 Department of Biology
2 Ricerche Interdipartimentale Biotecnologie Innovative (CRIBI), University of Padua, Padua I-35131, Italy


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
By a computational approach we reconstructed genomic transcriptional profiles of 19 different adult human tissues, based on information on activity of 27,924 genes obtained from unbiased UniGene cDNA libraries. In each considered tissue, a small number of genes resulted highly expressed or "tissue specific." Distribution of gene expression levels in a tissue appears to follow a power law, thus suggesting a correspondence between transcriptional profile and "scale-free" topology of protein networks. The expression of 737 genes involved in Mendelian diseases was analyzed, compared with a large reference set of known human genes. Disease genes resulted significantly more expressed than expected. The possible correspondence of their products to important nodes of intracellular protein network is suggested. Auto-organization of the protein network, its stability in time in the differentiated state, and relationships with the degree of genetic variability at genome level are discussed.

gene expression; protein network; Mendelian disease


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
IN THE LAST FEW YEARS, scale-free network models have been proposed to describe interactions underlying organization of biological systems (16, 22). Scale-free networks are characterized by robustness against perturbation, such as random depletion of network components (17, 22). Attempts to reconstruct protein-protein interaction networks by systematic analysis of protein complexes (7, 13), by two-hybrids approach (15), or by combination of computational and experimental approaches (25) produced results supporting these models.

Data concerning expression of genes in human tissues, derived from unbiased cDNA libraries, enabled in silico reconstruction of genomic expression profiles of different human tissues (35) and showed that a small fraction of genes account for a large proportion of the total transcriptional activity. Preliminary studies on adult skeletal muscle, heart, and retina indicated that Mendelian disease genes are relatively upregulated in affected tissues (35). High expression of a limited number of genes would produce a limited number of proteins in large amounts. Since it could be assumed that "importance" or connectivity of proteins is related to their abundance, genes mutated in Mendelian diseases, coding for proteins of pivotal role, should be highly expressed in disease cells and tissues.

We report here on quantitative analysis of the expression of 737 genes involved in Mendelian diseases compared with a set of 13,294 known human genes, as obtained from transcriptional profiles of 19 adult human tissues.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
Data mining and processing from UniGene, the largest available collection of human expressed genes (http://www.ncbi.nlm.nih.gov/UniGene/) (2), was performed by a fully automated procedure for expression profiles reconstruction based on Perl scripts. Only unbiased (unsubtracted and/or unnormalized) UniGene cDNA libraries pertaining to adult human tissues were considered.

An expression profile is a list of expressed genes, showing the abundance of their transcripts in a given cell or tissue. Each gene in a given profile is identified by gene name and description, UniGene cluster, LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/)(19) number, if available, and GenBank ID of the longest sequence representative of the cluster. The number of expressed sequence tags (ESTs) per UniGene cluster included in the expression profile of a given tissue was used to estimate the level of activity of the corresponding gene. Tissue expression profiles were reconstructed by using information on pool of libraries pertaining to the tissue. Only those tissues were considered for which a sufficiently large number of ESTs from unbiased libraries was available. The whole data set included 270,871 ESTs, corresponding to 27,924 UniGene "clusters", 14,131 of which corresponded to LocusLink entries and to "known human genes."

The distribution of number of genes according to the level of expression is exponential. For the observed distribution of number of genes with x level of expression, F(x) = ax-{lambda}, we used the nonlinear regression method to estimate parameters characterizing the function of the number of genes per level of expression in each tissue. Estimation was done to infer parameters describing the connectivity distribution of a hypothetical network, describing the protein-protein interactions between proteins encoded by the considered genes, under a proportionality hypothesis between expression level of genes and "importance" or connectivity of the corresponding proteins.

We further considered the disease loci of the OMIM Morbid Map (ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap) and, using a Perl script analyzing the files for the conversion of LocusLink/UniGene numbers and of LocusLink/MIM numbers, we established the exact correspondence between disease genes/loci and UniGene clusters represented in the reconstructed expression profiles. Genes involved in tumor development, susceptibility or resistance to diseases, and those involved in "conditions" or in nondisease phenotypes were excluded. At the end, a data set of 737 disease genes was obtained, whose mutations cause Mendelian disorders. We compared their expression with that of LocusLink entries corresponding to 13,294 known human genes ("reference set"), in 19 human tissues. Mendelian disease genes were excluded from the reference set.

For each given gene, average level of expression was calculated over all tissues in which its activity was detected. The average level of expression of disease genes was compared with that of genes of the reference set.

Genes of the reference set and disease genes were classified according to their average level of expression in all the considered tissues (average on tissues in which the gene is expressed; more than 5 ESTs per 10,000, high; from 1 to 5 ESTs per 10,000, moderate; less than 1 EST per 10,000, weak). Frequency distribution in the two groups was compared by {chi}2 test. The same was done separately for comparing expression levels of autosomal dominant or autosomal recessive genes with those of the reference set.

The hypothesis of independence between expression level of genes and mode of inheritance of the corresponding diseases was tested by means of a {chi}2 test on a contingency table, after grouping genes corresponding to autosomal dominant or autosomal recessive diseases into the three previously described ranks of expression.

Perl software used to generate data in this manuscript is freely available upon request to the authors.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
By a computational approach described elsewhere (3) and implemented in novel software (21), we reconstructed genomic transcriptional profiles of 19 different adult human tissues, based on information on activity of 27,924 genes, obtained from unbiased cDNA libraries (Table 1; http://telethon.bio.unipd.it/bioinfo/HGXP/). In each considered tissue, only a small proportion of genes appears highly expressed. An even smaller proportion of genes appears as tissue specific.


View this table:
[in this window]
[in a new window]
 
Table 1. Expression profiles of adult human tissues

 
The number of ESTs obtained for each given transcript from a specific tissue and derived from pools of unbiased cDNA libraries was used to estimate the expression level of the corresponding gene (3, 20, 21). In every considered tissue, a relatively small number of highly expressed genes appeared to account for a very large fraction of the total transcriptional activity, as shown in Fig. 1 and reported in Table 1 (columns 5 and 6). In general, slightly more than 10% of the total number of expressed genes accounts for one-half of the total transcriptional activity of any given tissue.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1. Gene expression profiles of five adult human tissues: plot of the log of the number of genes against the log of the number of expressed sequence tags (ESTs) per gene in each considered tissue. The plots pertain to five selected tissues for which considerably large ESTs samples were available.

 
Genes which appeared differentially expressed in a specific tissue, below the selected significance threshold level, are presumably "tissue specific" (Table 1, columns 7 and 8, {alpha} = 0.01 and {alpha} = 0.001, respectively). Such genes account for less than 1% of genes expressed in a given tissue.

It has been proposed that intracellular protein-to-protein interactions form a highly inhomogeneous scale-free network in which few highly connected proteins play a central role in interactions with numerous, less connected proteins (6, 12, 16). If, under a certain degree of inaccuracy, abundance of a given protein is due to level of expression of the corresponding gene, then we should expect that distribution of protein-encoding genes according to their levels of expression fits in a scale-free network model. We observed that the distribution of number of genes with x level of expression, F(x), follows an exponential trend (F(x) = ax-{lambda}) in each of the 19 considered tissues (minimum correlation coefficient of the fitting r2 0.80, average r2 0.98, median r2 0.99).

Connectivity of a network (P(k)) could be defined as the function describing the distribution of nodes having k connections with other nodes in the network. Under the assumption that level of expression of proteins correlates with number of interactions established with other proteins, it could be supposed that k {propto} x. We used the parameters characterizing the function of the number of genes per level of expression in each tissue, estimated with nonlinear regression method, to infer parameters describing the connectivity of the network, P(k). For each of the 19 tissues considered in our study, estimated parameters of the observed distribution, and â, resulted to be on average 1.7 and 3,334.9. Estimated values ranged from 1.2 and 2.4 with extreme values in melanocytes and in the adipose tissue, respectively. Relative homogeneity of estimated among the 19 different tissues suggests that genomic transcriptional profiles show common features, despite considerable biological differences among tissues.

We analyzed the expression of 737 human genes whose mutations cause Mendelian disorders, in 19 human tissues, compared with that of 13,294 known human genes. We excluded from analysis genes involved in nondisease phenotypes, psychiatric conditions, tumor development, or in susceptibility or resistance to diseases. The Supplemental Table (available at http://telethon.bio.unipd.it/bioinfo/Disease_genes/, as well as at the Physiological Genomics web site)1 reports expression data of 737 selected disease genes in 19 human tissues. Average expression level of disease genes in tissues resulted significantly higher than the mean of the 13,294 genes of the reference set (1.220 vs. 0.574, t-test, P = 3.29E-8). Disease genes, classified according to their average level of expression in all the considered tissues, appeared significantly more expressed than expected (observed 16.6%, expected 3.2%; {chi}2 test, P = 2.76E-56, Table 2), when compared with the sample of genes of the reference set. When comparing expression of genes involved in autosomal dominant diseases with genes of the reference sample, the difference appeared even more impressive (observed 18.9%, expected 3.2%; P = 2.24E-49).


View this table:
[in this window]
[in a new window]
 
Table 2. Disease genes expression compared with the reference set of genes

 
The hypothesis of independence between mode of inheritance of Mendelian diseases and level of expression of involved genes was tested by means of a {chi}2 test on a contingency table, after grouping genes corresponding to autosomal dominant or autosomal recessive diseases into three different ranks of gene expression. The hypothesis of independence was rejected (P = 0.00802), thus indicating that genes corresponding to autosomal dominant diseases tend to be highly expressed, whereas those corresponding to autosomal recessive tend to be moderately expressed, although, in general, genes involved in Mendelian diseases appear to be more expressed than expected.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
According to data obtained by the present study, the fraction of genes accounting for observed structural and functional differences among human tissues is probably small. A number of housekeeping genes must be ubiquitously active and some genes might be selectively overexpressed in contaminant tissues (e.g., vessels, blood, connective tissue, etc). In addition, some UniGene clusters could contain sequences derived from tissue-specific isoforms of transcripts belonging to the same gene. Therefore, the proportion of "tissue-specific" genes might be even smaller than observed. This view is in agreement with results obtained by multiple-tag experiments, indicating that differentiated state is characterized by high expression of a limited number of genes probably functionally related and possibly coregulated (14, 18, 24).

According to recent hypotheses, intracellular protein-to-protein interactions form a highly inhomogeneous scale-free network in which few highly connected proteins play a central role in interactions with numerous, less connected proteins (6, 12, 16).

Since abundance of a given protein is, in general, strongly correlated with level of expression of the corresponding gene (11), characteristics of protein-protein interaction network would be partially determined by the number of protein-encoding genes and by their respective levels of expression.

A very abundant protein, which likely corresponds to a highly transcribed gene, would potentially interact not only with other abundant proteins but also with rare or very rare polypeptides, thus becoming a relevant node in the network of protein-protein interactions. Statistical analysis of large-scale gene expression and protein interaction data showed that protein pairs encoded by coexpressed genes interact with each other more frequently than with random proteins and that the mean similarity of expression profiles is significantly higher for respective interacting protein pairs than for random ones (10). Moreover, Ge and colleagues (8) showed that pairs of genes encoding interacting proteins tend to be coexpressed and that gene expression data could be profitably used to refine models of protein interactions.

Highly expressed genes are expected to encode proteins representing vulnerable nodes, and therefore, their mutations are expected to be pathogenic. Recently, it was shown that removal of highly connected nodes causes a noticeable fragmentation in gene networks (22). Our observation that genes involved in Mendelian diseases are mostly highly or moderately expressed and, in particular, more expressed than expected if compared with a large set of reference genes, strongly suggests that disease genes encode for highly connected nodes of the network.

The "scale-free network" model applied to genomic expression might explain stability of many tissues in time. The structure of the protein-protein interactions network in a given cell or tissue is generated by potentiality of interactions inherent to the structure of individual proteins and to their relative abundance. Upregulation of a limited number of genes would force a given tissue to maintain stable structural and functional characteristics during time, unless strong perturbations would occur, affecting the transcription levels of genes coding for functional nodes, or creating novel nodes by upregulation of silent genes. This would be the case of perturbations induced by activity of genes encoding transcription factors, which regulate the expression of several additional genes. Defective mutations in regulatory genes are expected to produce complex and often lethal phenotypes, infrequently reported as Mendelian diseases.

According to "scale-free network" model, errors or attacks involving the majority of nodes showing small connectivity should not significantly alter the path structure of the remaining nodes and should have scarce impact on the overall network topology (1), i.e., on function. Systematic mutagenesis in yeast provided evidence of a striking capacity of tolerating deletion of substantial number of proteins from its proteome (16).

It is known that both physiological and developmental processes of eukaryotes display considerable robustness against the effects of mutations (26). Many loss-of-function mutations of developmental genes in higher organisms show no or weak phenotypic effect (9, 23, 27). It was shown in yeast that genes whose loss of function results in a weak or no-fitness effect are not more similar to their closest paralogues, both in sequence and temporal expression pattern (26). Thus gene duplications contribute little to mutational robustness on a genomic scale, whereas it seems to mainly depend on nature and organization of interactions between different proteins. Transferring these concepts to human genome, defective mutations in a relatively low number of genes encoding proteins corresponding to highly connected nodes seem to produce inherited disorders classified as Mendelian diseases. Therefore, defective mutations in genes encoding nodes with small connectivity are expected to have minor clinical consequences, selection against such variants are expected to be relatively mild, and genetic variability in these genes should be expectedly higher than in those directly involved in Mendelian diseases.


    GRANTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 
The financial support of Italian Ministry for Scientific and Technologic Research and of the University of Padua to G. A. Danieli and of the Italian Association for Cancer Research to S. Bortoluzzi is gratefully acknowledged.


    FOOTNOTES
 
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: G. A. Danieli, Dept. of Biology, Univ. of Padua, via U. Bassi 58 B, Padua I-35131, Italy (E-mail: danieli{at}bio.unipd.it).

10.1152/physiolgenomics.00095.2003.

1 The Supplementary Material for this article (expression data of 737 selected disease genes in 19 human tissues) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00095.2003/DC1. Back


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 GRANTS
 REFERENCES
 

  1. Albert R, Jeong H, and Barabasi AL. Error and attack tolerance of complex networks. Nature 406: 378–382, 2000.[ISI][Medline]
  2. Boguski MS and Schuler GD. ESTablishing a human transcript map. Nat Genet 10: 369–371, 1995.[ISI][Medline]
  3. Bortoluzzi S, d’Alessi F, and Danieli GA. A computational reconstruction of the adult human heart transcriptional profile. J Mol Cell Cardiol 32: 1931–1938, 2000.[ISI][Medline]
  4. Bortoluzzi S, d’Alessi F, and Danieli GA. A novel resource for the study of genes expressed in the adult human retina. Invest Ophthalmol Vis Sci 41: 3305–3308, 2000.[Abstract/Free Full Text]
  5. Bortoluzzi S, d’Alessi F, Romualdi C, and Danieli GA. The human adult skeletal muscle transcriptional profile reconstructed by a novel computational approach. Genome Res 10: 344–349, 2000.[Abstract/Free Full Text]
  6. Eisenberg D, Marcotte EM, Xenarios I, and Yeates TO. Protein function in the post-genomic era. Nature 405: 823–826, 2000.[ISI][Medline]
  7. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, and Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141–147, 2002.[ISI][Medline]
  8. Ge H, Liu Z, Church GM, and Vidal M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 29: 482–486, 2001.[ISI][Medline]
  9. Gonzalez-Gaitan M, Rothe M, Wimmer EA, Taubert H, and Jackle H. Redundant functions of the genes knirps and knirps-related for the establishment of anterior Drosophila head structures. Proc Natl Acad Sci USA 91: 8567–8571, 1994.[Abstract]
  10. Grigoriev A. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res 29: 3513–3519, 2001.[Abstract/Free Full Text]
  11. Gygi SP, Rochon Y, Franza BR, and Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19: 1720–1730, 1999.[Abstract/Free Full Text]
  12. Hartwell LH, Hopfield JJ, Leibler S, and Murray AW. From molecular to modular cell biology. Nature Suppl 402: C47–C52, 1999.[ISI][Medline]
  13. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, and Tyers M. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183, 2002.[ISI][Medline]
  14. Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, Weng Z, Mutter GL, Frosch MP, Macdonald ME, Milford EL, Crum CP, Bueno R, Pratt RE, Mahadevappa M, Warrington JA, Stephanopoulos G, Stephanopoulos G, and Gullans SR. A compendium of gene expression in normal human tissues. Physiol Genomics 7: 97–104, 2001. First published October 2, 2001; 10.1152/physiolgenomics.00040.2001.[Abstract/Free Full Text]
  15. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, and Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98: 4569–4574, 2001.[Abstract/Free Full Text]
  16. Jeong H, Mason SP, Parabasi AL, and Oltvai ZN. Lethality and centrality in protein networks. Nature 411: 41–42, 2001.[ISI][Medline]
  17. Jeong H, Tombor B, Albert R, Oltvai ZN, and Barabasi AL. The large-scale organization of metabolic networks. Nature 407: 651–654, 2000.[ISI][Medline]
  18. Liang S, Fuhrman S, and Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 3: 18–29, 1998.
  19. Pruitt KD and Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 29: 137–140, 2001.[Abstract/Free Full Text]
  20. Romualdi C, Bortoluzzi S, and Danieli GA. Detecting differentially expressed genes in multiple tag sampling experiments: comparative evaluation of statistical tests. Hum Mol Genet 10: 2133–2141, 2001.[Abstract/Free Full Text]
  21. Romualdi C, Bortoluzzi S, d’Alessi F, and Danieli GA. IDEG6: a web tool for detection of differentially expressed genes in multiple tag sampling experiments. Physiol Genomics 12: 159–162, 2003. First published November 12, 2002; 10.1152/physiolgenomics.00096.2002.[Abstract/Free Full Text]
  22. Rung J, Schlitt T, Brazma A, Freivalds K, and Vilo J. Building and analysing genome-wide gene disruption networks. Bioinformatics Suppl 2: S202–S210, 2002.
  23. Saga Y, Yagi T, Ikawa Y, Sakakura T, and Aizawa S. Mice develop normally without tenascin. Genes Dev 6: 1821–1831, 1992.[Abstract]
  24. Szallasi Z. Genetic network analysis in light of massively parallel biological data acquisition. Pac Symp Biocomput 4: 5–16, 1999.
  25. Tong AH, Drees B, Nardelli G, Bader GD, Brannetti B, Castagnoli L, Evangelista M, Ferracuti S, Nelson B, Paoluzi S, Quondam M, Zucconi A, Hogue CW, Fields S, Boone C, and Cesareni G. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295: 321–324, 2002.[Abstract/Free Full Text]
  26. Wagner A. Robustness against mutations in genetic networks of yeast. Nat Genet 24: 355–361, 2000.[ISI][Medline]
  27. Wang YK, Schnegelsberg PNJ, Dausman J, and Jaenisch R. Functional redundancy of the muscle-specific transcription factors myf5 and myogenin. Nature 379: 823–825, 1996.[ISI][Medline]