1 Department of Biology
2 Ricerche Interdipartimentale Biotecnologie Innovative (CRIBI), University of Padua, Padua I-35131, Italy
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
gene expression; protein network; Mendelian disease
![]() |
INTRODUCTION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Data concerning expression of genes in human tissues, derived from unbiased cDNA libraries, enabled in silico reconstruction of genomic expression profiles of different human tissues (35) and showed that a small fraction of genes account for a large proportion of the total transcriptional activity. Preliminary studies on adult skeletal muscle, heart, and retina indicated that Mendelian disease genes are relatively upregulated in affected tissues (35). High expression of a limited number of genes would produce a limited number of proteins in large amounts. Since it could be assumed that "importance" or connectivity of proteins is related to their abundance, genes mutated in Mendelian diseases, coding for proteins of pivotal role, should be highly expressed in disease cells and tissues.
We report here on quantitative analysis of the expression of 737 genes involved in Mendelian diseases compared with a set of 13,294 known human genes, as obtained from transcriptional profiles of 19 adult human tissues.
![]() |
METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
An expression profile is a list of expressed genes, showing the abundance of their transcripts in a given cell or tissue. Each gene in a given profile is identified by gene name and description, UniGene cluster, LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/)(19) number, if available, and GenBank ID of the longest sequence representative of the cluster. The number of expressed sequence tags (ESTs) per UniGene cluster included in the expression profile of a given tissue was used to estimate the level of activity of the corresponding gene. Tissue expression profiles were reconstructed by using information on pool of libraries pertaining to the tissue. Only those tissues were considered for which a sufficiently large number of ESTs from unbiased libraries was available. The whole data set included 270,871 ESTs, corresponding to 27,924 UniGene "clusters", 14,131 of which corresponded to LocusLink entries and to "known human genes."
The distribution of number of genes according to the level of expression is exponential. For the observed distribution of number of genes with x level of expression, F(x) = ax-, we used the nonlinear regression method to estimate parameters characterizing the function of the number of genes per level of expression in each tissue. Estimation was done to infer parameters describing the connectivity distribution of a hypothetical network, describing the protein-protein interactions between proteins encoded by the considered genes, under a proportionality hypothesis between expression level of genes and "importance" or connectivity of the corresponding proteins.
We further considered the disease loci of the OMIM Morbid Map (ftp://ftp.ncbi.nih.gov/repository/OMIM/morbidmap) and, using a Perl script analyzing the files for the conversion of LocusLink/UniGene numbers and of LocusLink/MIM numbers, we established the exact correspondence between disease genes/loci and UniGene clusters represented in the reconstructed expression profiles. Genes involved in tumor development, susceptibility or resistance to diseases, and those involved in "conditions" or in nondisease phenotypes were excluded. At the end, a data set of 737 disease genes was obtained, whose mutations cause Mendelian disorders. We compared their expression with that of LocusLink entries corresponding to 13,294 known human genes ("reference set"), in 19 human tissues. Mendelian disease genes were excluded from the reference set.
For each given gene, average level of expression was calculated over all tissues in which its activity was detected. The average level of expression of disease genes was compared with that of genes of the reference set.
Genes of the reference set and disease genes were classified according to their average level of expression in all the considered tissues (average on tissues in which the gene is expressed; more than 5 ESTs per 10,000, high; from 1 to 5 ESTs per 10,000, moderate; less than 1 EST per 10,000, weak). Frequency distribution in the two groups was compared by 2 test. The same was done separately for comparing expression levels of autosomal dominant or autosomal recessive genes with those of the reference set.
The hypothesis of independence between expression level of genes and mode of inheritance of the corresponding diseases was tested by means of a 2 test on a contingency table, after grouping genes corresponding to autosomal dominant or autosomal recessive diseases into the three previously described ranks of expression.
Perl software used to generate data in this manuscript is freely available upon request to the authors.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
It has been proposed that intracellular protein-to-protein interactions form a highly inhomogeneous scale-free network in which few highly connected proteins play a central role in interactions with numerous, less connected proteins (6, 12, 16). If, under a certain degree of inaccuracy, abundance of a given protein is due to level of expression of the corresponding gene, then we should expect that distribution of protein-encoding genes according to their levels of expression fits in a scale-free network model. We observed that the distribution of number of genes with x level of expression, F(x), follows an exponential trend (F(x) = ax-) in each of the 19 considered tissues (minimum correlation coefficient of the fitting r2 0.80, average r2 0.98, median r2 0.99).
Connectivity of a network (P(k)) could be defined as the function describing the distribution of nodes having k connections with other nodes in the network. Under the assumption that level of expression of proteins correlates with number of interactions established with other proteins, it could be supposed that k x. We used the parameters characterizing the function of the number of genes per level of expression in each tissue, estimated with nonlinear regression method, to infer parameters describing the connectivity of the network, P(k). For each of the 19 tissues considered in our study, estimated parameters of the observed distribution,
and â, resulted to be on average 1.7 and 3,334.9. Estimated
values ranged from 1.2 and 2.4 with extreme values in melanocytes and in the adipose tissue, respectively. Relative homogeneity of estimated
among the 19 different tissues suggests that genomic transcriptional profiles show common features, despite considerable biological differences among tissues.
We analyzed the expression of 737 human genes whose mutations cause Mendelian disorders, in 19 human tissues, compared with that of 13,294 known human genes. We excluded from analysis genes involved in nondisease phenotypes, psychiatric conditions, tumor development, or in susceptibility or resistance to diseases. The Supplemental Table (available at http://telethon.bio.unipd.it/bioinfo/Disease_genes/, as well as at the Physiological Genomics web site)1
reports expression data of 737 selected disease genes in 19 human tissues. Average expression level of disease genes in tissues resulted significantly higher than the mean of the 13,294 genes of the reference set (1.220 vs. 0.574, t-test, P = 3.29E-8). Disease genes, classified according to their average level of expression in all the considered tissues, appeared significantly more expressed than expected (observed 16.6%, expected 3.2%; 2 test, P = 2.76E-56, Table 2), when compared with the sample of genes of the reference set. When comparing expression of genes involved in autosomal dominant diseases with genes of the reference sample, the difference appeared even more impressive (observed 18.9%, expected 3.2%; P = 2.24E-49).
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
According to recent hypotheses, intracellular protein-to-protein interactions form a highly inhomogeneous scale-free network in which few highly connected proteins play a central role in interactions with numerous, less connected proteins (6, 12, 16).
Since abundance of a given protein is, in general, strongly correlated with level of expression of the corresponding gene (11), characteristics of protein-protein interaction network would be partially determined by the number of protein-encoding genes and by their respective levels of expression.
A very abundant protein, which likely corresponds to a highly transcribed gene, would potentially interact not only with other abundant proteins but also with rare or very rare polypeptides, thus becoming a relevant node in the network of protein-protein interactions. Statistical analysis of large-scale gene expression and protein interaction data showed that protein pairs encoded by coexpressed genes interact with each other more frequently than with random proteins and that the mean similarity of expression profiles is significantly higher for respective interacting protein pairs than for random ones (10). Moreover, Ge and colleagues (8) showed that pairs of genes encoding interacting proteins tend to be coexpressed and that gene expression data could be profitably used to refine models of protein interactions.
Highly expressed genes are expected to encode proteins representing vulnerable nodes, and therefore, their mutations are expected to be pathogenic. Recently, it was shown that removal of highly connected nodes causes a noticeable fragmentation in gene networks (22). Our observation that genes involved in Mendelian diseases are mostly highly or moderately expressed and, in particular, more expressed than expected if compared with a large set of reference genes, strongly suggests that disease genes encode for highly connected nodes of the network.
The "scale-free network" model applied to genomic expression might explain stability of many tissues in time. The structure of the protein-protein interactions network in a given cell or tissue is generated by potentiality of interactions inherent to the structure of individual proteins and to their relative abundance. Upregulation of a limited number of genes would force a given tissue to maintain stable structural and functional characteristics during time, unless strong perturbations would occur, affecting the transcription levels of genes coding for functional nodes, or creating novel nodes by upregulation of silent genes. This would be the case of perturbations induced by activity of genes encoding transcription factors, which regulate the expression of several additional genes. Defective mutations in regulatory genes are expected to produce complex and often lethal phenotypes, infrequently reported as Mendelian diseases.
According to "scale-free network" model, errors or attacks involving the majority of nodes showing small connectivity should not significantly alter the path structure of the remaining nodes and should have scarce impact on the overall network topology (1), i.e., on function. Systematic mutagenesis in yeast provided evidence of a striking capacity of tolerating deletion of substantial number of proteins from its proteome (16).
It is known that both physiological and developmental processes of eukaryotes display considerable robustness against the effects of mutations (26). Many loss-of-function mutations of developmental genes in higher organisms show no or weak phenotypic effect (9, 23, 27). It was shown in yeast that genes whose loss of function results in a weak or no-fitness effect are not more similar to their closest paralogues, both in sequence and temporal expression pattern (26). Thus gene duplications contribute little to mutational robustness on a genomic scale, whereas it seems to mainly depend on nature and organization of interactions between different proteins. Transferring these concepts to human genome, defective mutations in a relatively low number of genes encoding proteins corresponding to highly connected nodes seem to produce inherited disorders classified as Mendelian diseases. Therefore, defective mutations in genes encoding nodes with small connectivity are expected to have minor clinical consequences, selection against such variants are expected to be relatively mild, and genetic variability in these genes should be expectedly higher than in those directly involved in Mendelian diseases.
![]() |
GRANTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
FOOTNOTES |
---|
Address for reprint requests and other correspondence: G. A. Danieli, Dept. of Biology, Univ. of Padua, via U. Bassi 58 B, Padua I-35131, Italy (E-mail: danieli{at}bio.unipd.it).
10.1152/physiolgenomics.00095.2003.
1 The Supplementary Material for this article (expression data of 737 selected disease genes in 19 human tissues) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00095.2003/DC1.
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|