1Institute for Bioinformatics, GSF National Research Center for Environment and Health, Ingolstädter Landstrasse 1, D-85764 Neuherberg, 2Biomax Informatics AG, Lochhamer Strasse 9, D-82152 Martinsried and 3Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenschaftzentrum Weihenstephan, D-85350 Freising, Germany
4 To whom correspondence should be addressed at the Technische Universität München. E-mail: d.frishman{at}wzw.tum.de
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Keywords: aggregation/designability/disease/duplication
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Previously, disease genes were found to be more expressed (Bortoluzzi et al., 2003), longer (Smith and Eyre-Walker, 2003
), more tissue-specific (Smith and Eyre-Walker, 2003
; Winter et al., 2004
), have more synonymous nucleotide substitutions (Huang et al., 2004
) and have less members amongst slowly evolving housekeeping genes (Winter et al., 2004
). Comparisons conducted by López-Bigas and Ouzounis (2004)
at the protein level revealed that disease proteins tend to be longer, more conserved, phylogenetically more extended and have less highly conserved paralogs than the average human protein. They have subsequently exploited these differences to create a decision tree-based predictor of disease proteins.
A number of developments have allowed us to add to these studies. First, an intriguing hypothesis that proteins with more designable structures (i.e. proteins which have more sequences that encode their structures) were more robust to mutation and thermal stresses had been proposed (Li et al., 1996). In line with this hypothesis is the finding that proteins of a random sample of thermophiles exhibited a higher contact trace, a measure which correlates well with the designability, than a sample of mesophiles (England et al., 2003
). We hypothesized that since proteins could be functionally impaired by mutations or environmental stresses, disease proteins would be less designable than non-disease proteins. A second development involved the discovery of sequence properties associated with protein aggregation (Chiti et al., 2003
; DuBay et al., 2004
). Many diseases have been associated with protein aggregation (Dobson, 2004
; Ross and Poirier, 2004
), but the extent of this phenomenon had not been assessed. One could test whether disease proteins are more aggregation prone than non-disease proteins in terms of these properties.
In this work, we compared disease and non-disease proteins from the Ensembl human database (Birney et al., 2004) in terms of designability and aggregation propensity. In addition, we assessed the likelihood that proteins highly sequence similar to disease proteins would also be associated with disease based on the current level of annotation. We validated our findings using a differently annotated database of human proteins provided by Biomax Informatics, containing roughly twice the number of proteins annotated with disease.
![]() |
Materials and methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
A total of 34 111 proteins predicted to be encoded in the human genome were obtained from the Ensembl human v23.34e.1 database (Birney et al., 2004). We term this dataset the Ensembl protein dataset. OMIM (Hamosh et al., 2005
) based disease annotations for human genes were obtained using the Ensmart tool (Hammond and Birney, 2004
) and mapped to 2113 in the Ensembl protein dataset. OMIM is a database focused on heritable genetic diseases, most of which are of high penetrance. A larger set of 39 801 human proteins were obtained from the Biomax Human Genome Database (BHGDB), a product of Biomax Informatics (http://www.biomax.de). In all, 4352 proteins from the Biomax genome were manually annotated with disease information.
High-quality disease proteins
We consider all proteins with any disease-related annotation as disease proteins for our analysis. This annotation varies from suggestions of disease susceptibility effects upon mutation to diseasegene associations identified by positional cloning or by multiple methods. Disease-causing mutations in OMIM, however, are marked with a number in parentheses indicating whether the mutation was positioned by mapping the wild-type gene (1), by mapping the disease phenotype itself (2) or by both approaches (3) (see http://www.ncbi.nlm.nih.gov/Omim/omimfaq.html). To check if our results were different using higher quality data, a subset of 1470 proteins was obtained from the 2113 Ensembl protein dataset by excluding proteins that were not associated with a disease marked with a (3) in the OMIM annotation. We term this the high-quality disease protein set.
Proteins associated with disease caused by amino acid substitution
Proteins associated with disease caused by amino acid substitution (DPAA) were first screened for by text scanning OMIM entries associated with proteins found in the Ensembl and Biomax datasets using a Perl script. Only OMIM entries associated with a single protein in each dataset were included in order to avoid potential errors in mapping amino acid substitutions to the wrong protein. Potential DPAAs were identified by amino acid substitutions defined by the pattern ANA (where A is a letter or group of letters representing an amino acid and N is a residue number with no possibility of ANA representing nucleotide substitutions) in the OMIM text. For example, L234E would be considered an amino acid substitution while A23C would be ignored for this study, as A and C could be potentially represent nucleotides. The list of potential DPAAs was then refined manually.
Protein properties
In all cases in which disease and non-disease proteins were compared, only the largest protein encoded by each gene was included, as done by López-Bigas and Ouzounis (2004). Protein length, pI and SCOP (Andreeva et al., 2004
) assignments were obtained from the PEDANT system (Riley et al., 2005
). SCOP folds were assigned to proteins if the corresponding sequences were within a BlastP (Altschul et al., 1997
) E-value of 106. The residues A, C, F, G, I, L, M, P, V, W and Y were considered to be hydrophobic and H, Q, N, S, T, K, R, D and E were considered hydrophilic in this study.
Designability
Protein designability was measured by counting the number of families in each fold contained in a given protein and taking the minimum. For example, if protein A contains three domains with folds F1, F2 and F3 and these folds in turn contain eight, three and seven families, respectively, protein A's minimum family count would be three. By recording the minimum family count of the folds in proteins, we assessed their designability by computing the designability of their least designable fold.
![]() |
Results and discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Mutation or environmental change can disrupt and/or create aberrant function in proteins. Disease proteins were hypothesized to contain more often structures which are susceptible to perturbation by mutation or external stresses. Structures which are more designable tend to be more robust against mutation and thermal fluctuations (Zhang, 1997; Wingreen et al., 2004
). Because a direct relationship exists between the number of sequences and the number of families in protein folds (Zhang et al., 1997
), one can measure a protein fold's designability, by counting the number of families in that fold. To provide an estimate of a protein's structural susceptibility to perturbing influences, we counted the number of families in each SCOP domain fold (Andreeva et al., 2004
) contained in the protein and recorded the minimum count (see Materials and methods). In other words, we assessed the designability of proteins by computing the designability of their least designable fold. We reason that if a domain which occupies a portion of a protein is destabilized and becomes misfolded owing to stress or mutation (within or outside the domain), it is likely that the function of the entire protein would be affected. Therefore, it was intuitive that we assessed designability in this way.
Analysis of our human datasets revealed that disease proteins tend to have significantly smaller minimum family counts than the average human protein. Disease proteins have a noticeably larger proportion of folds containing only one family than non-disease proteins (Figure 1). A similar trend was observed when SCOP superfamilies were counted instead of families (data not shown). Overall, these results were independent of the length of proteins examined (see Supplementary material S1). Taken together, our study on designability and disease suggest that disease proteins tend to be intrinsically less robust to mutation or external stresses than those belonging to the average human protein.
|
Disease proteins are more likely to aggregate
A possible consequence of structural perturbation by mutation or environmental change is that of misfolding or unfolding of proteins leading to aggregate formation. Aggregates or their precursors are cytotoxic and can cause cell death (Bucciantini et al., 2002, 2004
). In comparing disease and non-disease proteins, we find that the former tend to have isoelectric points closer to neutrality and more stretches of alternating hydrophobichydrophilic residues (of length 5 or more) than the latter (Figure 2A and B). Such properties have been implicated to increase aggregation rates of unfolded proteins in in vitro experiments (Chiti et al., 2003
; DuBay et al., 2004
). These results suggest that disease proteins tend to be more aggregation prone than non-disease proteins and complement work by Dima and Thirumalai (2004)
, which suggested that low sequence correlation entropies, mixed charged-hydrophobic and charged-polar runs in proteins may be indicative of disease association and tendency to aggregate.
|
|
There are reasons to believe that sequence similarity to known disease proteins may also be a significant factor contributing to disease propensity. Protein length, designability, isoelectric point and sequence stretch patterns have so far been implicated to contribute to disease propensity. All of these properties are dependent on the sequences of the corresponding proteins. In addition, highly sequence similar proteins are likely to share interacting partners (Yu et al., 2004) which may serve to link disease proteins functionally to non-disease proteins. Such functional linkage may indicate that both disease and non-disease proteins share a function that when disrupted would cause disease. Alternatively, disrupting the function of a non-disease protein functionally linked to a disease protein may subsequently disrupt the function of the disease protein and cause disease. Moreover, if the DNA sequences which encode proteins are sufficiently similar, non-allelic homologous recombination, a mechanism associated with disease (Bailey et al., 2002
; Shaw and Lupsky, 2004
), may also occur. Hence one may hypothesize that annotating non-disease proteins, highly sequence similar to disease proteins, as disease proteins would be valid for many proteins. To confirm this, the proportions of disease proteins in the Ensembl and Biomax human databases with duplicates annotated with disease were assessed (Table II). Almost 40% of the disease proteins in the Biomax database (and 50% in the Ensembl database) have duplicates (paralogs) associated with disease. Over one-fifth of disease proteins have all duplicates associated with disease in both databases. Disease proteins represent 5 and 9% of proteins in the Ensembl and Biomax datasets, respectively (Table I). In contrast, the chance that a duplicate of a disease protein is also a disease protein is significantly higher than expected (
2 test: P < 0.01) at 20 and 29% in the Ensembl and Biomax datasets, respectively (Table II). These findings strongly suggest that assigning disease status to non-disease proteins based on high sequence similarity is valid for many proteins.
|
The large proportion of disease proteins with duplicates associated with disease suggests that gene duplication is a significant phenomenon contributing to the expansion of disease-prone protein families. These families may contain proteins associated with different diseases. A clustering of OMIM entries based on sequence similarity of their associated proteins is shown in Figure 4. Examining such clusters allows the identification of disease families, which may yield new insight into how diseases may be related.
|
By comparing disease and non-disease proteins, we have shown that disease proteins tend to have folds with less families than non-disease proteins, suggesting that they are intrinsically more structurally vulnerable to mutation or environmental stresses. Disease proteins also tend to be longer, have isoelectric points closer to neutrality and more aggregation-prone stretches than non-disease proteins, suggesting that the former are more likely to aggregate than the latter upon unfolding or misfolding. Many disease proteins are duplicates of other disease proteins, reinforcing the notion that sequence similarity to known disease proteins can contribute substantially to disease propensity. These results were apparent even when we defined our disease protein set to include only those proteins with known disease causing amino acid substitutions and when we chose to use a high-quality disease protein set (Supplementary material S3, S57).
The reader should be aware that our results are based on incomplete data. We have defined the domains in human proteins using SCOP, which is biased towards domains which are soluble and commendable to structural determination. However, disease proteins tend to be relatively more conserved than non-disease proteins (López-Bigas and Ouzounis, 2004). Therefore, even without consideration of what domains are defined by SCOP, one finds that disease proteins are more sequence restricted, consistent with the hypothesis that they have less designable structures (if any) than non-disease proteins.
Our results are also dependent on the accuracy of the human gene models in the human genomes used, the proteins predicted to be expressed using these gene models and the annotation associating proteins with various diseases. Our results are robust to two different human databases with different levels of disease annotation. We have also verified them against a high-quality subset of diseasegene relations.
The probability that a protein becomes associated with disease depends on multiple factors, including the mutation type, the protein involved, the rest of the organism and the environment to which the organism is exposed. Our results are global trends gleaned from data currently stored in databases. The trends do not necessarily apply to specific populations or individuals. It would be of great interest to integrate information from epidemiological studies with the trends derived here to shed light on this matter.
The finding that families of sequence similar disease proteins exist suggests common origins and mechanisms to many of our modern diseases. Drugs targeting one member of a particular family may affect others in that family. Understanding the properties which predispose proteins for disease and how they may have evolved will perhaps aid the identification of novel gene-to-disease relations and the treatment of the associated diseases.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
Andreeva,A., Howorth,D., Brenner,S.E., Hubbard,T.J., Chothia,C. and Murzin,A.G. (2004) Nucleic Acids Res., 32, D226D229.
Bailey,J.A., Gu,Z., Clark,R.A., Reinert,K., Samonte,R.V., Schwartz,S., Adams,M.D., Myers,E.W., Li,P.W. and Eichler,E.E. (2002) Science, 297, 10031007.
Bingham,J. and Sudarsanam,S. (2000) Bioinformatics, 16, 660661.[Abstract]
Birney,E. et al. (2004) Genome Res., 14, 925928.
Bortoluzzi,S., Romualdi,C., Bisognin,A. and Danieli G.A. (2003) Physiol. Genomics, 15, 223227.
Botstein,D. and Risch,N. (2003) Nat. Genet., 33(Suppl.), 228237.[CrossRef][ISI][Medline]
Bucciantini,M., Giannoni,E., Chiti,F., Baroni,F., Formigli,L., Zurdo,J., Taddei,N., Ramponi,G., Dobson,C.M. and Stefani,M. (2002) Nature, 416, 507511.[CrossRef][ISI][Medline]
Bucciantini,M., Calloni,G., Chiti,F., Formigli,L., Nosi,D., Dobson,C.M. and Stefani,M. (2004) J. Biol. Chem., 279, 3137431382.
Carlson,C.S., Eberle,M.A., Kruglyak,L. and Nickerson,D.A. (2004) Nature, 429, 446452.[CrossRef][ISI][Medline]
Chenna,R., Sugawara,H., Koike,T., Lopez,R., Gibson,T.J., Higgins,D.G. and Thompson,J.D. (2003) Nucleic Acids Res., 31, 34973500.
Chiti,F., Stefani,M., Taddei,N., Ramponi,G. and Dobson,C.M. (2003) Nature, 424, 805808.[CrossRef][ISI][Medline]
Dima,R.I. and Thirumalai,D. (2004) Bioinformatics, 20, 23452354.
Dobson,C.M. (2004) Semin. Cell Dev. Biol., 15, 316.[CrossRef][ISI][Medline]
DuBay,K.F., Pawar,A.P., Chiti,F., Zurdo,J., Dobson,C.M. and Vendruscolo,M. (2004) J. Mol. Biol., 341, 13171326.[CrossRef][ISI][Medline]
England,J.L., Shakhnovich,B.E. and Shakhnovich,E.I. (2003) Proc. Natl Acad. Sci. USA, 100, 87278731.
Ferrer-Costa,C., Orozco,M. and de la Cruz,X. (2002) J. Mol. Biol., 315, 771786.[CrossRef][ISI][Medline]
Hammond,M.P. and Birney,E.R (2004) Trends Genet., 20, 268272.
Hamosh,A., Scott,A.F., Amberger,J.S., Bocchini,C.A. and McKusick,V.A. (2005) Nucleic Acids Res., 33, D514D517.
Huang,H. et al. (2004) Genome Biol., 5, R47.[CrossRef][Medline]
Li,H., Helling,R., Tang,C. and Wingreen,N. (1996) Science, 273, 666669.[Abstract]
López-Bigas,N. and Ouzounis,C.A. (2004) Nucleic Acids Res., 32, 31083114.
Ramensky,V., Bork,P. and Sunyaev,S. (2002) Nucleic Acids Res., 30, 38943900.
Reumers,J., Schymkowitz,J., Ferkinghoff-Borg,J., Stricher,F., Serrano,L. and Rousseau,F. (2005) Nucleic Acids Res., 33, D527D532.
Riley,M.L., Schmidt,T., Wagner,C., Mewes,H.W. and Frishman,D. (2005) Nucleic Acids Res., 33, Database Issue, D308D310.
Ross,C.A. and Poirier,M.A. (2004) Nat. Med., 10, S10S17.[CrossRef][Medline]
Scully,J.L. (2004) EMBO Rep., 5, 650653.
Shaw,C.J. and Lupski,J.R. (2004) Hum. Mol. Genet., 13, R57R64.
Smith,N.G. and Eyre-Walker,A. (2003) Gene, 318, 169175.[CrossRef][ISI][Medline]
Stenson,P.D., Ball,E.V., Mort,M., Phillips,A.D., Shiel,J.A., Thomas,N.S., Abeysinghe,S., Krawczak,M. and Cooper,D.N. (2003) Hum. Mutat., 21, 577581.[CrossRef][ISI][Medline]
Steward,R.E., MacArthur,M.W., Laskowski,R.A. and Thornton,J.M. (2003) Trends Genet., 19, 505513.[CrossRef][ISI][Medline]
Terp,B.N., Cooper,D.N., Christensen,I.T., Jorgensen,F.S., Bross,P., Gregersen,N. and Krawczak,M. (2002) Hum. Mutat., 20, 98109.[CrossRef][ISI][Medline]
Wang,Z. and Moult,J. (2001) Hum. Mutat., 17, 263270.[CrossRef][ISI][Medline]
Wingreen,N., Li,H. and Tang,C. (2004) Polymer, 45, 699705.[CrossRef][ISI]
Winter,E.E., Goodstadt,L. and Ponting,C.P. (2004) Genome Res., 14, 5461.
Yu,H., Luscombe,N.M., Lu,H.X., Zhu,X., Xia,Y., Han,J.D., Bertin,N., Chung,S., Vidal,M. and Gerstein,M. (2004) Genome Res., 14, 11071118.
Zhang,C.T. (1997) Protein Eng., 10, 757761.[CrossRef][ISI][Medline]
Received February 15, 2005; revised May 25, 2005; accepted August 3, 2005.
Edited by Luis Serrano
|