European Media Laboratory, Heidelberg, Germany
![]() |
Abstract |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Introduction |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Barabási and Albert (1999) introduced a theoretical model that generates graphs demonstrating a connectivity distribution which decays as a power-law. This feature was found to be a direct consequence of the following two generic mechanisms: (1) networks are allowed to expand continuously by the addition of new vertices, and (2) these newly added nodes attach preferentially to sites that are already well connected (Barabási and Albert 1999
). Since this feature is independent of the actual size of the network, this class of inhomogeneous networks was called scale-free networks. The topology of the World Wide Web was investigated by considering HTML documents as vertices connected by links pointing from one page to another (Albert, Jeong, and Barabási 1999
; Barabási and Albert 1999
; Barabási, Albert, and Jeong 2000
). The latter net, as well as the Internet which emerges from connecting different servers, demonstrates scale-free properties. Both nets display a high degree of robustness against errors (Albert, Jeong, and Barabási 2000
). However, these networks are highly vulnerable to perturbations of the highly connected nodes.
Biological Networks
Recently, scale-free and small-world behaviors have also been found in biological networks. Watts and Strogatz (1998)
reported the architecture of the Caenorhabditis elegans nervous system to show significant small-world behavior. Fell and Wagner (2000)
assembled a list of stoichiometric equations representing the central routes of energy metabolism and small-molecule building block synthesis in Escherichia coli. A substrate graph, defined by a vertex set consisting of all metabolites that occur in the network, was constructed. Two metabolites were considered to be linked if they occurred in the same reaction. Fell and Wagner (2000)
found the substrate graph to be sparse, with glutamate, coenzyme A, 2-oxoglutarate, pyruvate, and glutamine having the highest degree of connectivity. This sample of metabolites might be viewed as a core of E. coli metabolism which was found without any subjective criteria.
Most recently, Jeong et al. (2000)
comparatively analyzed metabolic networks of organisms representing all three domains of life. The metabolic network is represented by nodes, the substrates, connected by directed edges symbolizing the actual reaction. The topologies of these networks are best described by a scale-free model. Furthermore, the diameters of the nets remain the same for all of these networks regardless of the number of substrates found in the given species. Interestingly, the ranking of the most connected substrates is largely identical for all organisms, thus indicating hubs which dominate the topology of the nets. Like the technical networks, the E. coli network theoretically has high tolerance to random errors but severe sensitivity to the removal of the highly connected nodes.
Another biochemical network is formed by sets of domains which are linearly arranged in protein sequences. This might generate graphs comprising interesting features. Since the topology of graphs thus generated is still unknown, it is worth considering this way of treating domain architectures.
Domain Organization
Protein crystallography reveals that the fundamental unit of protein structure is the domain. Independent of neighboring sequences, this region of a polypeptide chain folds into a distinct structure and mediates biological functionality (Janin and Chothia 1985
). Most proteins contain only one single domain (Doolittle 1995
). Some sequences appear as multidomain proteins adopting different linear arrangements of their domain sets. On average, such domain architectures comprise two to three domains; however, some human proteins contain up to 130 domains (Li et al. 2001
).
Similar to the discussion about the role of certain metabolites in the emergence of metabolism, there has been a debate about the actual number of existing domains and their origin. One view treats all past and present proteins as the result of shuffling of a large set of primordial polypeptides (Dorit and Gilbert 1991
). These are assumed to result from splicing events involving exons separated by introns (Gilbert and Glynias 1993
). The other view deals with the existence of a few small polypeptides in the early stages of life; these are the predecessors of most contemporary proteins (Doolittle 1995
). Gene duplication and subsequent modification were employed to form the latter molecules from this small set of polypeptides. Independent of the timing for the introduction of introns, recombination in introns provides a mechanism for the exchange of exons between genes. This mechanism for the acquisition of new functions by eukaryotic genes is commonly known as "exon shuffling." It was assumed that primitive proteins were encoded by exons that were spliced together (Seidel, Pompliano, and Knowles 1992
). However, such shuffling events take on biological significance only if the exons involved carry a functional or structural domain. Although many examples of exon shuffling have been found, no significant correspondence between exons and units of protein structure has been detected (Stoltzfus et al. 1994
).
It is common to find that newly sequenced proteins are homologous to some other known proteins over parts of their lengths. Thus, most proteins may have descended from relatively few ancestral types. The sequences of large proteins often show signs of having evolved by the joining of preexisting domains in new combinations. Such a mechanism is commonly known as "domain shuffling" and appears as two types: domain duplication and domain insertion (Doolittle 1995
). Domain duplication refers to the internal duplication of at least one domain in a gene. Domain insertion denotes the process by which structural or functional domains are exchanged between proteins or inserted into a protein. Shuffling of domains has more biological significance than exon shuffling because domains are real structural and functional units in proteins, while exons often are not.
Functional links between proteins have also been detected by analyzing the fusion patterns of protein domains. Two separate proteins A and B in one organism may be expressed as a fusion protein in other species. A protein sequence containing both A and B is termed a Rosetta Stone sequence. However, this framework applies only in a minority of cases (Marcotte et al. 1999
).
Protein Domain Databases
Currently, there are a large variety of databases, each collecting protein domain information in completely different ways. The Prosite database (http://expasy.proteome.org.au/prosite/) consists of biologically significant motifs and profiles determined and formulated with appropriate computational tools. Uncharacterized proteins are assigned to certain protein families with the aid of weight matrices and profiles (Hofmann et al. 1999
). The majority of Prosite documentation refers to motifs thus providing combined motif and domain information. Release 16.0 of Prosite contains 1,374 different patterns, rules, and profiles.
Another database is Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml), which is a large collection of multiple-sequence alignments of protein families and profile hidden Markov models (Bateman et al. 2000
). Moreover, Pfam contains curated documentation for all 2,478 families in version 5.5, covering nearly 65% of SwissProt release 38 and SP-TrEMBL release 11.
Many more protein families are found, however, in ProDom (http://www.toulouse.inra.fr/prodom.html) (Corpet et al. 2000
), which contains all protein domain families that can be generated automatically from the SwissProt and TrEMBL sequence databases (Bairoch and Apweiler 2000
). Expert-validated families are extended by using Pfam seed alignments to build new ProDom families with the Psi-Blast database searching algorithm (Altschul et al. 1997
). Other families are generated by recursive use of Psi-Blast. ProDom, version 99.2, has 157,648 domain families, covering almost 95% of SwissProt release 37 and TrEMBL release 10. ProDom offers higher coverage than Pfam. However, ProDom tends to overpredict the number of protein families which can be discovered as subsets of larger families.
Finally, InterPro (http://www.ebi.ac.uk/interpro) (Apweiler et al. 2001a) is an integrated documentation resource of protein families, domains, and functional sites rationalizing the complementary efforts of the Prosite, Pfam, ProDom, and Prints (Attwood et al. 2000
) database projects. InterPro contains manually curated documentation and diagnostic signatures from these databases and uses these to create a unique, nonredundant characterization of protein families, domains, and functional sites.
Proteome Databases
The advent of fully sequenced genomes of various organisms has facilitated the investigation of proteomes. The Proteome Analysis database (http://www.ebi.ac.uk/proteome) (Apweiler et al. 2001b) has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. The analysis is compiled using mainly InterPro and CluSTr (Kriventseva et al. 2001
) and is performed on the nonredundant complete proteome sets of SwissProt and TrEMBL entries. The latest release provides 41 nonredundant proteomes of genomes of archaea, bacteria, and eukaryotes.
Most recently, SwissProt and Ensembl have prepared a complete nonredundant human proteome set consisting of 30,585 sequences. The set consists of the combination of the SwissProt/TrEMBL nonredundant human proteome set (15,691 sequences) and additional nonredundant peptides predicted by Ensembl (14,894 sequences). Ensembl (http://www.ensembl.org) provides complete and consistent annotation across the human genome.
In this paper, domain networks generated with data from ProDom, Pfam, and Prosite domain databases will be presented. Furthermore, InterPro domain networks of different species that are generated with complete proteome sets provided by the Proteome Analysis database will be considered. Subsequently, the topology of these networks will be investigated, and biological and evolutionary consequences will be discussed.
![]() |
Materials and Methods |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Growing amounts of empirical and theoretical data about the topologies of large complex networks indicate the emergence of several network types. Basically, these types are classified by the connectivity distribution P(k) of nodes. Exponential networks are characterized by P(k), which peaks at an average k
and decays exponentially. Prominent protagonists of this type are the random graph model (Erdös and Rényi 1960
) and the small-world model (Watts and Strogatz 1998
). Both lead to fairly homogenous networks with nodes comprising approximately the same number of links k
k
(Barabási, Albert, and Jeong 1999
). Furthermore, a small-world graph adopts a sparse topology, L
Lrandom, but remains more highly clustered than an equally sparse random graph, C >> Crandom (Watts and Strogatz 1998
). By contrast, in the class of inhomogeneous networks called scale-free networks, the connectivity distribution decays as a power-law P(k)
k-
. The latter result indicates a network free of a characteristic scale. Compared with exponential networks, the probability that a node is highly connected (k >>
k
) is statistically significant in scale-free networks (Barabási and Albert 1999
).
In this study, protein domain information was retrieved from the ProDom, Prosite, and Pfam databases. Sixty-five percent of all ProDom sequences correspond to families containing 10 or more members. In order to restrict the size of the network, the sample of ProDom domains focuses on these families. Thus, 5,995 ProDom domains were obtained. The Prosite database declares false-negative entries which were filtered out of the sample used for the network construction. Sequence entries of each database provide SwissProt annotation. Thus, every protein sequence was itemized with each domain that it contained. This was done for each database separately. Domains which were listed due to their occurrence in one protein sequence represent vertices which are connected to each other in the domain graphs.
Complete proteome data sets of different species were retrieved from the Proteome Analysis database, which uses InterPro annotation of protein domains. Such proteome data sets adopt SwissProt, TrEMBL, TrEMBLnew, and Ensembl annotation of proteins. Analogously, InterPro domains which appear along with other domains in a protein sequence represent vertices which are connected to each other in the domain graphs. The numbers of links to other domains in such graphs were logarithmically binned, and frequencies were thus obtained. Such pairs of values were subjected to a linear regression procedure.
PAJEK (the Slovene word for spider), a program for large-network analysis and visualization, was used for the calculation of the latter values (Batagelj and Mrvar 1998
). This program is available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/.
![]() |
Results |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
|
|
|
|
|
|
|
![]() |
Discussion |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Completeness and Quality of Data
Regardless of whether Pfam, Prosite, or ProDom domain information is used, the qualitative topology of domain networks remains unchanged. Since these databases differ significantly in size and methodology, the argument is tempting that even though the current domain data are far from complete, the topology of domain networks will not change significantly with the growing amount of domain data. This assumption is supported by the characteristics of scale-free networks leading to domain graphs which are independent of the actual size of the underlying networks. Hence, the major observation that the topology of domain graphs is mainly dominated by few highly linked domains will not be changed entirely with the incorporation of new protein domain data. InterPro gathers and streamlines mostly distinct domain information from the above-mentioned domain databases, providing a centralized annotation resource to reduce the amount of duplication between the database resources. Hence, scale-free characteristics of InterPro domain networks which were generated with the aid of complete proteomes of different species do not change significantly in comparison to networks generated with domain information from a single database. However, it should be noted that the acquisition of protein domain information is biased to a certain extent, since eukaryotic and mammalian proteins are far better studied and documented in databases on average than archeae or prokaryotic proteins.
Another important consideration regards aspects of acquisition of proteome information. Proteome data which were entirely extracted by genome translation might not sufficiently explain the setup of all cellular processes. Domain networks were generated with the aid of translated genome databases which did not cover effects that include alternative splicing and domain usage. Alternative pre-mRNA splicing is an important mechanism for regulating gene expression in higher eukaryotes (Smith, Patton, and Nadal-Ginard 1989
). By recent estimates, the primary transcripts of
30% of human genes are subject to alternative splicing. Thus, the connectivity of domains found in higher eukaryotes might be significantly higher than it is "in silico."
In addition, the differences in frequency distributions between higher eukaryotes, bacteria, and archaea in figures 4 and 5
might also be related to the numbers of domain architectures that were found in the different organisms. Since eukaryotes and mammals developed much more distinct domain architectures (International Human Genome Sequencing Consortium 2001
), the respective distributions of domain connections are statistically more reliable than those of prokaryotes and archaea. Therefore, future studies should clarify whether the small number of domain architectures leads to slight artifacts in the slope of prokaryotic and archeal organisms.
Evolutionary Aspects
Are the observed topologies a direct consequence of domain evolution? The model of Barabási and Albert (1999)
generates scale-free networks by preferential attachment of newly added vertices to already well connected ones. Consequently, Fell and Wagner (2000)
argued that vertices with many connections in a metabolic network were metabolites originating very early in the course of evolution and shaping a core metabolism. Analogously, highly connected domains could also have originated very early. If one compares the lists of the most highly linked domains in table 3
, this assumption does not hold. The majority of more highly linked domains in Methanococcus and E. coli are mainly concerned with the maintenance of metabolism. Given that in Methanococcus and E. coli nearly none of the highly linked domains in the higher organisms can be found, and vice versa, the focus of domain connection shifts to domain hubs involved in signal transduction, transcription, and cell-cell interactions. In addition, helicase C has roughly similar degrees of connection in all organisms. However, the ankyrin repeat motif (ank) is one of the few domains which can be found to be unlinked in E. coli, whereas it possesses a growing degree of connectivity in higher eukaryotes.
Apparently, the majority of highly connected domains seem to have arisen late in eukaryotes of larger proteome size. The evolutionary trend toward multicellularity requires proteomes which feature new and additional complex cellular processes like signal transduction or cell-cell contacts. One way of accomplishing growing demands is the expansion of already-existing protein sets. Indeed, many protein families are expanded in humans relative to Drosophila and C. elegans. These are mainly involved in inter- and intracellular signaling pathways, apoptosis (Aravind, Dixit, and Koonin 2001
), development, and immune and neural functions (International Human Genome Sequencing Consortium 2001
; Venter et al. 2001
). Although many protein families of these organisms exhibit great disparities in abundance, C2H2-type zinc finger motifs and eukaryotic protein kinase (pkinase) are among the top 10 most frequent domain families (Rubin et al. 2000
; Tupler, Perini, and Green 2001
) and the best-connected domains in table 3 . At least in higher eukaryotes, both domains tend to increase their connections to other domains in a way similar to that of the already-mentioned ankyrin repeat motif (ank).
Although the human phenotypic complexity exceeds the respective ones of Drosophila and C. elegans by far, proteome dimensions remain considerably low. Thus, combinatorial aspects of domain arrangements might have a major impact on the preservation of cellular processes. Among chromatin-associated proteins, transcription factors, and especially apoptosis proteins, a significant portion of protein architecture is shared between humans and Drosophila. However, substantial innovation in the creation of new protein architectures was significantly detectable (International Human Genome Sequencing Consortium 2001
). Apparently, expansion of particular domain families and accompanying evolution of complex domain architectures from presumably preexisting domains coincides with the increase of the organism's complexity. In this regard, the different slopes in figures 4 and 5
indicate this evolutionary trend to higher connectivity of domains (e.g., pkinase, SH3, and EGF in table 3
), as well as a growing complexity in the arrangement of domains within proteins. In comparison to noneukaryotes, Drosophila developed more complex domain architectures. Thus, the frequency distributions of the latter organisms can be clearly separated in figure 5
, where lower complexity in domain architecture is indicated by steeper slopes. The first point is well reflected by the slightly different slopes of humans, Drosophila, and C. elegans in figure 4
.
In conclusion, a variety of arguments point to an increase in the complexity of the proteome from the single-celled yeast to multicellular vertebrates such as humans. Essentially, the expansion of protein families coincides with the increase of connectivity of the respective domains. Extensive shuffling of domains to increase combinatorial diversity might provide protein sets which are sufficient to preserve cellular procedures without dramatically expanding the absolute size of the protein complement. Hence, the relatively greater proteome complexity of higher eukaryotes, and especially humans, cannot be simply a consequence of genome size but, to a certain extent, must also be a consequence of innovations in domain arrangements. Thus, highly linked domains represent functional centers in various different cellular aspects. They could be treated as evolutionary hubs which help to organize the domain space by occasionally linking them to numerous other functionally related domains.
Quality of the Basic Models
The view that new protein architectures can be created by shuffling, adding, and deleting domains, resulting in new proteins from old parts, is well reflected by the emergence of such domain hubs. However, there exist a variety of domain arrangements which contradict the ideal image of continuous addition of new domain links to already-existing hubs in the sense of scale-free networks. The S1 RNA-binding domain is linked to helicase C in E. coli, while it is found to be connected to RNB, KH domain, and RNAse PH in humans. Neither the procedure of generating a small-world graph in the original sense nor the scale-free model provides the deletion of vertices. However, the assumption that domains emerge and disappear occasionally is a basic demand of protein evolution. Thus, scale-free and small-world models can obviously only be a rough approximation to the real situation.
![]() |
Acknowledgements |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
![]() |
Footnotes |
---|
1 Keywords: protein domains
scale-free and small-world topology
evolution of protein architectures
2 Address for correspondence and reprints: Stefan Wuchty, European Media Laboratory, Schloß-Wolfsbrunnenweg 33, D-69118 Heidelberg, Germany. stefan.wuchty{at}eml.villa-bosch.de
![]() |
References |
---|
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
---|
Albert R., H. Jeong, A. Barabási, 1999 Diameter of the World Wide Web Nature 401:130-131[ISI]
. 2000 Error and attack tolerance of complex networks Nature 406:378-382[ISI][Medline]
Altschul S., T. Madden, A. Schaeffer, J. Zhang, Z. Zhang, W. Miller, D. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402
Apweiler R., T. Attwood, A. Bairoch, et al. (26 co-authors) 2001a. The InterPro database, an integrated documentation resource for protein families, domains and functional sites Nucleic Acids Res 29:37-40
Apweiler R., M. Biswas, W. Fleischmann, et al. (11 co-authors) 2001b. Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes Nucleic Acids Res 29:44-48
Aravind L., V. Dixit, E. Koonin, 2001 Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons Science 291:1279-1284
Attwood T., M. Croning, D. Flower, A. Lewis, J. Mabey, P. Scordis, J. Selley, W. Wright, 2000 PRINT-S: the database formerly known as PRINTS Nucleic Acids Res 28:225-227
Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48
Barabási A., R. Albert, 1999 Emergence of scaling in random networks Science 286:509-512
Barabási A., R. Albert, H. Jeong, 1999 Mean-field theory for scale-free random networks Physica A 272:173-187[ISI]
. 2000 Scale-free characteristics of random networks: the topology of the World-Wide Web Physica A 281:69-77[ISI]
Barthélémy M., L. Amaral, 1999 Small-world networks: evidence for a crossover picture Phys. Rev. Lett 82:3180-3183[ISI]
Batagelj V., A. Mrvar, 1998 PAJEKprogram for large network analysis Connections 21:47-57
Bateman A., E. Birney, R. Durbin, S. Eddy, K. Howe, E. Sonnhammer, 2000 The Pfam protein families database Nucleic Acids Res 28:263-266
Bornberg-Bauer E., 1997 How are model protein structures distributed in sequence space? Biophys. J 5:2393-2403
Corpet F., F. Servant, J. Gouzy, D. Kahn, 2000 ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons Nucleic Acids Res 28:267-269
Doolittle R., 1995 The multiplicity of domains in proteins Annu. Rev. Biochem 64:287-314[ISI][Medline]
Dorit R., W. Gilbert, 1991 The limited universe of exons Curr. Opin. Genet. Dev 1:464-469[Medline]
Erdös P., A. Rényi, 1960 On the evolution of random graphs Publ. Math. Inst. Hung. Acad. Sci 5:17-61
Fell D., A. Wagner, 2000 The small world of metabolism Nat. Biotech 189:1121-1122
Gilbert W., M. Glynias, 1993 On the ancient nature of introns Gene 135:137-144[ISI][Medline]
Guare J., 1990 Six degrees of separation: a play Vintage Books, New York
Hofmann K., P. Bucher, L. Falquet, A. Bairoch, 1999 The PROSITE database, its status in 1999 Nucleic Acids Res 27:215-219
Huberman B., P. Pirolli, J. Pitkow, R. Lukose, 1998 Strong regularities in World Wide Web surfing Science 280:95-97
International Human Genome Sequencing Consortium. 2001 Initial sequencing and analysis of the human genome Nature 409:860-921[ISI][Medline]
Janin J., C. Chothia, 1985 Domains in proteins: definitions, location, and structural principles Methods Enzymol 115:420-430[ISI][Medline]
Jeong H., B. Tombor, R. Albert, Z. Oltvai, A.-L. Barabási, 2000 The large-scale organization of metabolic networks Nature 407:651-654[ISI][Medline]
Kriventseva E., W. Fleischmann, E. Zdobnoy, R. Apweiler, 2001 CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins Nucleic Acids Res 29:33-36
Li W.-H., Z. Gu, H. Wang, A. Nekrutenko, 2001 Evolutionary analyses of the human genome Nature 409:847-849[ISI][Medline]
Marcotte E., M. Pellegrini, H.-L. Ng, D. Rice, T. Yeates, D. Eisenberg, 1999 Detecting protein function and protein-protein interactions from genome sequences Science 285:751-753
Milgram S., 1967 The small-world problem Psychol. Today 2:60-67
Miller G., E. Newman, 1958 Tests of a statistical explanation of the rank-frequency relation for words in written English Am. J. Psychol 71:209-218[ISI][Medline]
Rubin G., M. Yandell, J. Wortmann, et al. (52 co-authors) 2000 Comparative genomics of the eukaryotes Science 287:2204-2215
Schuster P., W. Fontana, P. Stadler, I. Hofacker, 1994 From sequences to shapes and back: a case study in RNA secondary structures Proc. R. Soc. Lond. B Biol. Sci 255:279-284[ISI][Medline]
Seidel H., D. Pompliano, J. Knowles, 1992 Exons as microgenes Science 257:1489-1490[ISI][Medline]
Smith C., J. Patton, B. Nadal-Ginard, 1989 Alternative splicing in the control of gene expression Annu. Rev. Genet 23:527-577[ISI][Medline]
Stoltzfus A., D. Spencer, M. Zuker, J. Logsdon Jr.,, W. Doolittle, 1994 Testing the exon theory of genes: the evidence from protein structure Science 265:202-207[ISI][Medline]
Tupler R., G. Perini, M. Green, 2001 Expressing the human genome Nature 409:832-833[ISI][Medline]
Venter J., M. Adams, E. Myers, et al. (271 co-authors) 2001 The sequence of the human genome Science 291:1304-1351
Watts D., S. Strogatz, 1998 Collective dynamics of small-world networks Nature 393:440-442[ISI][Medline]