Scale-Free Behavior in Protein Domain Networks

Stefan Wuchty

European Media Laboratory, Heidelberg, Germany


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Several technical, social, and biological networks were recently found to demonstrate scale-free and small-world behavior instead of random graph characteristics. In this work, the topology of protein domain networks generated with data from the ProDom, Pfam, and Prosite domain databases was studied. It was found that these networks exhibited small-world and scale-free topologies with a high degree of local clustering accompanied by a few long-distance connections. Moreover, these observations apply not only to the complete databases, but also to the domain distributions in proteomes of different organisms. The extent of connectivity among domains reflects the evolutionary complexity of the organisms considered.


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Many diverse systems may best be described as networks with complex topologies. Often, the connection topology is assumed to be either completely regular or completely random (Erdös and Rényi 1960). Watts and Strogatz (1998) revealed a new class of network topologies that lies somewhere between these two extremes. Originally, these small-world networks were generated by randomly rewiring nodes in a regular network. Small-world networks combine the local clustering of connections characteristic of regular networks with occasional long-range connections between clusters, as can be expected to occur in random networks. By defining measures that distinguish these three types of networks, Watts and Strogatz (1998)Citation showed that several biological, technological, and social networks are of the small-world type. A small-world graph is formally defined as a sparse graph which is much more highly clustered than an equally sparse random graph. Barthélémy and Amaral (1999) provided evidence that the appearance of small-world behavior is not a phase transition, but a crossover phenomenon which depends on both the network size and the degree of disorderCitation . Small-world graphs were first illustrated with friendship networks (Milgram 1967Citation ) in sociology, often referred to as "six degrees of separation" (Guare 1990Citation ). The architecture of the power grid of the western United States, the structures of some sociological networks dealing with mathematical collaborations on publications, and the casting of actors in movies were found to be small-world graphs (Watts and Strogatz 1998Citation ).

Barabási and Albert (1999) introduced a theoretical model that generates graphs demonstrating a connectivity distribution which decays as a power-law. This feature was found to be a direct consequence of the following two generic mechanisms: (1) networks are allowed to expand continuously by the addition of new vertices, and (2) these newly added nodes attach preferentially to sites that are already well connected (Barabási and Albert 1999Citation ). Since this feature is independent of the actual size of the network, this class of inhomogeneous networks was called scale-free networks. The topology of the World Wide Web was investigated by considering HTML documents as vertices connected by links pointing from one page to another (Albert, Jeong, and Barabási 1999Citation ; Barabási and Albert 1999Citation ; Barabási, Albert, and Jeong 2000Citation ). The latter net, as well as the Internet which emerges from connecting different servers, demonstrates scale-free properties. Both nets display a high degree of robustness against errors (Albert, Jeong, and Barabási 2000Citation ). However, these networks are highly vulnerable to perturbations of the highly connected nodes.

Biological Networks
Recently, scale-free and small-world behaviors have also been found in biological networks. Watts and Strogatz (1998)Citation reported the architecture of the Caenorhabditis elegans nervous system to show significant small-world behavior. Fell and Wagner (2000)Citation assembled a list of stoichiometric equations representing the central routes of energy metabolism and small-molecule building block synthesis in Escherichia coli. A substrate graph, defined by a vertex set consisting of all metabolites that occur in the network, was constructed. Two metabolites were considered to be linked if they occurred in the same reaction. Fell and Wagner (2000)Citation found the substrate graph to be sparse, with glutamate, coenzyme A, 2-oxoglutarate, pyruvate, and glutamine having the highest degree of connectivity. This sample of metabolites might be viewed as a core of E. coli metabolism which was found without any subjective criteria.

Most recently, Jeong et al. (2000)Citation comparatively analyzed metabolic networks of organisms representing all three domains of life. The metabolic network is represented by nodes, the substrates, connected by directed edges symbolizing the actual reaction. The topologies of these networks are best described by a scale-free model. Furthermore, the diameters of the nets remain the same for all of these networks regardless of the number of substrates found in the given species. Interestingly, the ranking of the most connected substrates is largely identical for all organisms, thus indicating hubs which dominate the topology of the nets. Like the technical networks, the E. coli network theoretically has high tolerance to random errors but severe sensitivity to the removal of the highly connected nodes.

Another biochemical network is formed by sets of domains which are linearly arranged in protein sequences. This might generate graphs comprising interesting features. Since the topology of graphs thus generated is still unknown, it is worth considering this way of treating domain architectures.

Domain Organization
Protein crystallography reveals that the fundamental unit of protein structure is the domain. Independent of neighboring sequences, this region of a polypeptide chain folds into a distinct structure and mediates biological functionality (Janin and Chothia 1985Citation ). Most proteins contain only one single domain (Doolittle 1995Citation ). Some sequences appear as multidomain proteins adopting different linear arrangements of their domain sets. On average, such domain architectures comprise two to three domains; however, some human proteins contain up to 130 domains (Li et al. 2001Citation ).

Similar to the discussion about the role of certain metabolites in the emergence of metabolism, there has been a debate about the actual number of existing domains and their origin. One view treats all past and present proteins as the result of shuffling of a large set of primordial polypeptides (Dorit and Gilbert 1991Citation ). These are assumed to result from splicing events involving exons separated by introns (Gilbert and Glynias 1993Citation ). The other view deals with the existence of a few small polypeptides in the early stages of life; these are the predecessors of most contemporary proteins (Doolittle 1995Citation ). Gene duplication and subsequent modification were employed to form the latter molecules from this small set of polypeptides. Independent of the timing for the introduction of introns, recombination in introns provides a mechanism for the exchange of exons between genes. This mechanism for the acquisition of new functions by eukaryotic genes is commonly known as "exon shuffling." It was assumed that primitive proteins were encoded by exons that were spliced together (Seidel, Pompliano, and Knowles 1992Citation ). However, such shuffling events take on biological significance only if the exons involved carry a functional or structural domain. Although many examples of exon shuffling have been found, no significant correspondence between exons and units of protein structure has been detected (Stoltzfus et al. 1994Citation ).

It is common to find that newly sequenced proteins are homologous to some other known proteins over parts of their lengths. Thus, most proteins may have descended from relatively few ancestral types. The sequences of large proteins often show signs of having evolved by the joining of preexisting domains in new combinations. Such a mechanism is commonly known as "domain shuffling" and appears as two types: domain duplication and domain insertion (Doolittle 1995Citation ). Domain duplication refers to the internal duplication of at least one domain in a gene. Domain insertion denotes the process by which structural or functional domains are exchanged between proteins or inserted into a protein. Shuffling of domains has more biological significance than exon shuffling because domains are real structural and functional units in proteins, while exons often are not.

Functional links between proteins have also been detected by analyzing the fusion patterns of protein domains. Two separate proteins A and B in one organism may be expressed as a fusion protein in other species. A protein sequence containing both A and B is termed a Rosetta Stone sequence. However, this framework applies only in a minority of cases (Marcotte et al. 1999Citation ).

Protein Domain Databases
Currently, there are a large variety of databases, each collecting protein domain information in completely different ways. The Prosite database (http://expasy.proteome.org.au/prosite/) consists of biologically significant motifs and profiles determined and formulated with appropriate computational tools. Uncharacterized proteins are assigned to certain protein families with the aid of weight matrices and profiles (Hofmann et al. 1999Citation ). The majority of Prosite documentation refers to motifs thus providing combined motif and domain information. Release 16.0 of Prosite contains 1,374 different patterns, rules, and profiles.

Another database is Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml), which is a large collection of multiple-sequence alignments of protein families and profile hidden Markov models (Bateman et al. 2000Citation ). Moreover, Pfam contains curated documentation for all 2,478 families in version 5.5, covering nearly 65% of SwissProt release 38 and SP-TrEMBL release 11.

Many more protein families are found, however, in ProDom (http://www.toulouse.inra.fr/prodom.html) (Corpet et al. 2000Citation ), which contains all protein domain families that can be generated automatically from the SwissProt and TrEMBL sequence databases (Bairoch and Apweiler 2000Citation ). Expert-validated families are extended by using Pfam seed alignments to build new ProDom families with the Psi-Blast database searching algorithm (Altschul et al. 1997Citation ). Other families are generated by recursive use of Psi-Blast. ProDom, version 99.2, has 157,648 domain families, covering almost 95% of SwissProt release 37 and TrEMBL release 10. ProDom offers higher coverage than Pfam. However, ProDom tends to overpredict the number of protein families which can be discovered as subsets of larger families.

Finally, InterPro (http://www.ebi.ac.uk/interpro) (Apweiler et al. 2001a) is an integrated documentation resource of protein families, domains, and functional sites rationalizing the complementary efforts of the Prosite, Pfam, ProDom, and Prints (Attwood et al. 2000Citation ) database projects. InterPro contains manually curated documentation and diagnostic signatures from these databases and uses these to create a unique, nonredundant characterization of protein families, domains, and functional sites.

Proteome Databases
The advent of fully sequenced genomes of various organisms has facilitated the investigation of proteomes. The Proteome Analysis database (http://www.ebi.ac.uk/proteome) (Apweiler et al. 2001b) has been set up to provide comprehensive statistical and comparative analyses of the predicted proteomes of fully sequenced organisms. The analysis is compiled using mainly InterPro and CluSTr (Kriventseva et al. 2001Citation ) and is performed on the nonredundant complete proteome sets of SwissProt and TrEMBL entries. The latest release provides 41 nonredundant proteomes of genomes of archaea, bacteria, and eukaryotes.

Most recently, SwissProt and Ensembl have prepared a complete nonredundant human proteome set consisting of 30,585 sequences. The set consists of the combination of the SwissProt/TrEMBL nonredundant human proteome set (15,691 sequences) and additional nonredundant peptides predicted by Ensembl (14,894 sequences). Ensembl (http://www.ensembl.org) provides complete and consistent annotation across the human genome.

In this paper, domain networks generated with data from ProDom, Pfam, and Prosite domain databases will be presented. Furthermore, InterPro domain networks of different species that are generated with complete proteome sets provided by the Proteome Analysis database will be considered. Subsequently, the topology of these networks will be investigated, and biological and evolutionary consequences will be discussed.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
A domain graph GD = (VD, ED) is formally defined by a vertex set VD consisting of all domains found within proteins. Two domains are regarded as being adjacent if they occur together in one protein at least once. An undirected edge connecting these two vertices indicates this relationship. Such connections define the edges set ED. In this graph, the degree k of a vertex is the number of other vertices to which it is linked. The mean path length L from a vertex to any other vertex of the graph is defined as the average of the path lengths to all other vertices. Another important quantity is the clustering coefficient C({upsilon}) of a vertex {upsilon}. It measures the fraction of the vertices connected to {upsilon} which are also connected to each other. In extension, the clustering coefficient C of the graph is defined as the average of C({upsilon}) over all {upsilon}.

Growing amounts of empirical and theoretical data about the topologies of large complex networks indicate the emergence of several network types. Basically, these types are classified by the connectivity distribution P(k) of nodes. Exponential networks are characterized by P(k), which peaks at an average <k> and decays exponentially. Prominent protagonists of this type are the random graph model (Erdös and Rényi 1960Citation ) and the small-world model (Watts and Strogatz 1998Citation ). Both lead to fairly homogenous networks with nodes comprising approximately the same number of links k ~ <k> (Barabási, Albert, and Jeong 1999Citation ). Furthermore, a small-world graph adopts a sparse topology, L >= Lrandom, but remains more highly clustered than an equally sparse random graph, C >> Crandom (Watts and Strogatz 1998Citation ). By contrast, in the class of inhomogeneous networks called scale-free networks, the connectivity distribution decays as a power-law P(k) ~ k-{gamma}. The latter result indicates a network free of a characteristic scale. Compared with exponential networks, the probability that a node is highly connected (k >> <k>) is statistically significant in scale-free networks (Barabási and Albert 1999Citation ).

In this study, protein domain information was retrieved from the ProDom, Prosite, and Pfam databases. Sixty-five percent of all ProDom sequences correspond to families containing 10 or more members. In order to restrict the size of the network, the sample of ProDom domains focuses on these families. Thus, 5,995 ProDom domains were obtained. The Prosite database declares false-negative entries which were filtered out of the sample used for the network construction. Sequence entries of each database provide SwissProt annotation. Thus, every protein sequence was itemized with each domain that it contained. This was done for each database separately. Domains which were listed due to their occurrence in one protein sequence represent vertices which are connected to each other in the domain graphs.

Complete proteome data sets of different species were retrieved from the Proteome Analysis database, which uses InterPro annotation of protein domains. Such proteome data sets adopt SwissProt, TrEMBL, TrEMBLnew, and Ensembl annotation of proteins. Analogously, InterPro domains which appear along with other domains in a protein sequence represent vertices which are connected to each other in the domain graphs. The numbers of links to other domains in such graphs were logarithmically binned, and frequencies were thus obtained. Such pairs of values were subjected to a linear regression procedure.

PAJEK (the Slovene word for spider), a program for large-network analysis and visualization, was used for the calculation of the latter values (Batagelj and Mrvar 1998Citation ). This program is available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The domain graphs are sparse, with small average degrees (table 1 ) compared with the maximal possible degree k = n - 1, where n is the number of vertices. In this respect, the results of figure 1 are interesting. The vertices which denote Prosite domains were ranked by the frequency of their connectivity. The curve is similar to a generalized Zipf's law curve, in which it is observed that the frequency of occurrence of some event f(x) as a function of the rank x is a power-law function f(x) = a(b + x)-c, with the exponent c close to unity. The plot of Prosite domains in figure 1 satisfies the latter condition, with c = 0.89. We are thus dealing with relatively few highly connected domains and many rarely connected ones. Essentially, the frequency distributions of ProDom and Pfam domains are similar. However, they fit the generalized Zipf's law less well. Distributions following Zipf's law have also been observed in the context of literary vocabulary (Miller and Newman 1958Citation ), frequency of secondary structures of RNA (Schuster et al. 1994Citation ), lattice proteins (Bornberg-Bauer 1997Citation ), and hits per web page on the World Wide Web (Huberman et al. 1998Citation ). This observation is in accordance with the picture of scale-free networks which are topologically dominated by a few highly connected hubs.


View this table:
[in this window]
[in a new window]
 
Table 1 Some Basic Data for the ProDom, Prosite, and Pfam Graph

 


View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1.—The frequency distribution of Prosite domain connectivity. The number of links to other domains are ranked by their frequencies, which follow a generalized Zipf's law: f(x) = a(b + x)-c, with x being the rank and f(x) being its frequency. Parameter values of the best fit (dot-dashed curve) are a = 0.21; b = 7.93; and c = 0.89

 
As illustrated in figure 2 , frequency distributions of vertices with degree k follow a distribution comparable to a power-law distribution. Although the shapes of the distribution curves are different, they share an area of linearity. Regarding these latter areas, the frequency distribution of links from ProDom domains follows P(k) {approx} k-{gamma} with {gamma} = 2.5. By contrast, the distributions of degrees of Pfam and Prosite domains follow the same law with {gamma} = 1.7. Although the curves do not follow exactly the proposed curvature of the frequency of degrees in the original scale-free model, one can observe a type of scale-free dependence even if the scale-free model is a raw approximation of the real situation. Obviously, the topology of such domain graphs is better described by a highly heterogenous scale-free or small-world model than by an exponential model.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 2.—The frequency distribution of domain connections within protein sequences. Domain data were obtained from the ProDom, Pfam, and Prosite protein databases

 
In table 2 it can be observed that the domain graphs partially satisfy the structural properties of small-world graphs. While clustering coefficients C{upsilon} of the domain graphs by far exceed the respective coefficients of corresponding random graphs, the characteristic path lengths L{upsilon} do not accomplish the demanded qualifications of a small-world graph. Emphasizing the observation that the vast majority of proteins contains only one domain (Marcotte et al. 1999Citation ), the domain networks contain a huge amount of unconnected vertices (see table 1 ). This feature of domain distribution among protein sequences illustrates in particular the large number of connected components in domain graphs. Although domain graphs are thus highly scattered, every graph contains a major subnet among its connected components which gathers the majority of domains. These major components feature L{upsilon} and C{upsilon} values that satisfy the demands of small-world graphs by exceeding the respective values of random graphs of equal size. Thus, this study focuses on the analysis of the major components exhibiting small-world and scale-free behavior. In order to clarify the graph topology, figure 3 displays the major component of the network which was generated by proteome data of Saccharomyces cerevisiae.


View this table:
[in this window]
[in a new window]
 
Table 2 Characteristic Path Length L and Clustering Coefficient C of ProDom, Pfam, and Prosite Domain Nets

 


View larger version (45K):
[in this window]
[in a new window]
 
Fig. 3.—Major component of the domain network of Saccharomyces cerevisiae, comprising 204 vertices and 347 edges

 
The investigations carried out so far consider all domains without taking into account their origin. Presumably, the degree of connectivity is different if one focuses on different species. All domain connections of six species which developed differently in the course of evolution were extracted from the complete proteome sets provided by the Proteome Analysis database. As illustrated in figures 4 and 5 , the frequency distributions of links regarding humans, C. elegans, Drosophila, yeast, E. coli, and Methanococcus still follow the expected power-law. However, the slopes of the lines are slightly different. Interestingly, the slopes of humans and Drosophila nearly coincide in figure 4 . Moreover, the regression lines show almost the same interception in comparison with C. elegans. In figure 5 , the situation changes slightly. While the slopes in comparison with humans are significantly steeper, the regression lines of yeast, E. coli, and Methanococcus run nearly parallel. Thus, it is tempting to assume a trend which guides multicellular organisms to higher domain connectivity.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 4.—The frequency distribution of domain connections within protein sequences of Caenorhabditis elegans, Drosophila, and humans. The domain data were obtained from the Proteome Analysis database. The numbers of links to other domains were logarithmically binned, and frequencies were thus obtained. These pairs of values were subject to a linear regression procedure. Regression lines of Drosophila and human coincide

 


View larger version (15K):
[in this window]
[in a new window]
 
Fig. 5.—The frequency distribution of domain connections within protein sequences of Methanococcus, Escherichia coli, yeast, and humans. The domain data were obtained from the Proteome Analysis database. The numbers of links to other domains were logarithmically binned, and frequencies were thus obtained. These pairs of values were subject to a linear regression procedure

 
Interestingly, the majority of highly connected InterPro domains appear in signaling pathways, as the list of the 10 best linked domains of different species in table 3 reveals. Obviously, the evolutionary trend toward compartmentalization of the cell and multicellularity demands a higher degree of organization. Therefore, more emphasis is put on the maintenance of inter- and intracellular signaling channels, cell-cell contacts, and integrity. Hence, proteomes have to provide protein sets which cover such cellular demands. The growing number of highly linked domains of signaling and extracellular proteins seen in comparisons of archaea, prokaryotes, and eukaryotes confirms this assumption.


View this table:
[in this window]
[in a new window]
 
Table 3 The Ten Most Highly Connected InterPro Domains of Methanococcus, Escherichia coli, Yeast, Caenorhabditis elegans, Drosophila, and Humans

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
What might be the functional, phylogenetic, or bioinformatic implications of the power-law distribution of the connectivity of domains and the small-world behavior of the domain networks studied?

Completeness and Quality of Data
Regardless of whether Pfam, Prosite, or ProDom domain information is used, the qualitative topology of domain networks remains unchanged. Since these databases differ significantly in size and methodology, the argument is tempting that even though the current domain data are far from complete, the topology of domain networks will not change significantly with the growing amount of domain data. This assumption is supported by the characteristics of scale-free networks leading to domain graphs which are independent of the actual size of the underlying networks. Hence, the major observation that the topology of domain graphs is mainly dominated by few highly linked domains will not be changed entirely with the incorporation of new protein domain data. InterPro gathers and streamlines mostly distinct domain information from the above-mentioned domain databases, providing a centralized annotation resource to reduce the amount of duplication between the database resources. Hence, scale-free characteristics of InterPro domain networks which were generated with the aid of complete proteomes of different species do not change significantly in comparison to networks generated with domain information from a single database. However, it should be noted that the acquisition of protein domain information is biased to a certain extent, since eukaryotic and mammalian proteins are far better studied and documented in databases on average than archeae or prokaryotic proteins.

Another important consideration regards aspects of acquisition of proteome information. Proteome data which were entirely extracted by genome translation might not sufficiently explain the setup of all cellular processes. Domain networks were generated with the aid of translated genome databases which did not cover effects that include alternative splicing and domain usage. Alternative pre-mRNA splicing is an important mechanism for regulating gene expression in higher eukaryotes (Smith, Patton, and Nadal-Ginard 1989Citation ). By recent estimates, the primary transcripts of ~30% of human genes are subject to alternative splicing. Thus, the connectivity of domains found in higher eukaryotes might be significantly higher than it is "in silico."

In addition, the differences in frequency distributions between higher eukaryotes, bacteria, and archaea in figures 4 and 5 might also be related to the numbers of domain architectures that were found in the different organisms. Since eukaryotes and mammals developed much more distinct domain architectures (International Human Genome Sequencing Consortium 2001Citation ), the respective distributions of domain connections are statistically more reliable than those of prokaryotes and archaea. Therefore, future studies should clarify whether the small number of domain architectures leads to slight artifacts in the slope of prokaryotic and archeal organisms.

Evolutionary Aspects
Are the observed topologies a direct consequence of domain evolution? The model of Barabási and Albert (1999)Citation generates scale-free networks by preferential attachment of newly added vertices to already well connected ones. Consequently, Fell and Wagner (2000)Citation argued that vertices with many connections in a metabolic network were metabolites originating very early in the course of evolution and shaping a core metabolism. Analogously, highly connected domains could also have originated very early. If one compares the lists of the most highly linked domains in table 3 , this assumption does not hold. The majority of more highly linked domains in Methanococcus and E. coli are mainly concerned with the maintenance of metabolism. Given that in Methanococcus and E. coli nearly none of the highly linked domains in the higher organisms can be found, and vice versa, the focus of domain connection shifts to domain hubs involved in signal transduction, transcription, and cell-cell interactions. In addition, helicase C has roughly similar degrees of connection in all organisms. However, the ankyrin repeat motif (ank) is one of the few domains which can be found to be unlinked in E. coli, whereas it possesses a growing degree of connectivity in higher eukaryotes.

Apparently, the majority of highly connected domains seem to have arisen late in eukaryotes of larger proteome size. The evolutionary trend toward multicellularity requires proteomes which feature new and additional complex cellular processes like signal transduction or cell-cell contacts. One way of accomplishing growing demands is the expansion of already-existing protein sets. Indeed, many protein families are expanded in humans relative to Drosophila and C. elegans. These are mainly involved in inter- and intracellular signaling pathways, apoptosis (Aravind, Dixit, and Koonin 2001Citation ), development, and immune and neural functions (International Human Genome Sequencing Consortium 2001Citation ; Venter et al. 2001Citation ). Although many protein families of these organisms exhibit great disparities in abundance, C2H2-type zinc finger motifs and eukaryotic protein kinase (pkinase) are among the top 10 most frequent domain families (Rubin et al. 2000Citation ; Tupler, Perini, and Green 2001Citation ) and the best-connected domains in table 3 . At least in higher eukaryotes, both domains tend to increase their connections to other domains in a way similar to that of the already-mentioned ankyrin repeat motif (ank).

Although the human phenotypic complexity exceeds the respective ones of Drosophila and C. elegans by far, proteome dimensions remain considerably low. Thus, combinatorial aspects of domain arrangements might have a major impact on the preservation of cellular processes. Among chromatin-associated proteins, transcription factors, and especially apoptosis proteins, a significant portion of protein architecture is shared between humans and Drosophila. However, substantial innovation in the creation of new protein architectures was significantly detectable (International Human Genome Sequencing Consortium 2001Citation ). Apparently, expansion of particular domain families and accompanying evolution of complex domain architectures from presumably preexisting domains coincides with the increase of the organism's complexity. In this regard, the different slopes in figures 4 and 5 indicate this evolutionary trend to higher connectivity of domains (e.g., pkinase, SH3, and EGF in table 3 ), as well as a growing complexity in the arrangement of domains within proteins. In comparison to noneukaryotes, Drosophila developed more complex domain architectures. Thus, the frequency distributions of the latter organisms can be clearly separated in figure 5 , where lower complexity in domain architecture is indicated by steeper slopes. The first point is well reflected by the slightly different slopes of humans, Drosophila, and C. elegans in figure 4 .

In conclusion, a variety of arguments point to an increase in the complexity of the proteome from the single-celled yeast to multicellular vertebrates such as humans. Essentially, the expansion of protein families coincides with the increase of connectivity of the respective domains. Extensive shuffling of domains to increase combinatorial diversity might provide protein sets which are sufficient to preserve cellular procedures without dramatically expanding the absolute size of the protein complement. Hence, the relatively greater proteome complexity of higher eukaryotes, and especially humans, cannot be simply a consequence of genome size but, to a certain extent, must also be a consequence of innovations in domain arrangements. Thus, highly linked domains represent functional centers in various different cellular aspects. They could be treated as evolutionary hubs which help to organize the domain space by occasionally linking them to numerous other functionally related domains.

Quality of the Basic Models
The view that new protein architectures can be created by shuffling, adding, and deleting domains, resulting in new proteins from old parts, is well reflected by the emergence of such domain hubs. However, there exist a variety of domain arrangements which contradict the ideal image of continuous addition of new domain links to already-existing hubs in the sense of scale-free networks. The S1 RNA-binding domain is linked to helicase C in E. coli, while it is found to be connected to RNB, KH domain, and RNAse PH in humans. Neither the procedure of generating a small-world graph in the original sense nor the scale-free model provides the deletion of vertices. However, the assumption that domains emerge and disappear occasionally is a basic demand of protein evolution. Thus, scale-free and small-world models can obviously only be a rough approximation to the real situation.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The inspiration for this work came from Ursula Kummer, who was always ready to discuss this work, as well as Erich Bornberg-Bauer, Isabel Rojas, Carel van Gend, Martin Vingron, and Rebecca Wade. Ioannis Xenarios gave me numerous valuable hints. I was supported by Andrej Mrvar and Vladimir Batagelj during the work with their PAJEK program. The Klaus Tschira-Foundation (KTF) is gratefully acknowledged for funding this project.


    Footnotes
 
William Taylor, Reviewing Editor

1 Keywords: protein domains scale-free and small-world topology evolution of protein architectures Back

2 Address for correspondence and reprints: Stefan Wuchty, European Media Laboratory, Schloß-Wolfsbrunnenweg 33, D-69118 Heidelberg, Germany. stefan.wuchty{at}eml.villa-bosch.de Back


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Albert R., H. Jeong, A. Barabási, 1999 Diameter of the World Wide Web Nature 401:130-131[ISI]

    ———. 2000 Error and attack tolerance of complex networks Nature 406:378-382[ISI][Medline]

    Altschul S., T. Madden, A. Schaeffer, J. Zhang, Z. Zhang, W. Miller, D. Lipman, 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Nucleic Acids Res 25:3389-3402[Abstract/Free Full Text]

    Apweiler R., T. Attwood, A. Bairoch, et al. (26 co-authors) 2001a. The InterPro database, an integrated documentation resource for protein families, domains and functional sites Nucleic Acids Res 29:37-40[Abstract/Free Full Text]

    Apweiler R., M. Biswas, W. Fleischmann, et al. (11 co-authors) 2001b. Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes Nucleic Acids Res 29:44-48[Abstract/Free Full Text]

    Aravind L., V. Dixit, E. Koonin, 2001 Apoptotic molecular machinery: vastly increased complexity in vertebrates revealed by genome comparisons Science 291:1279-1284[Abstract/Free Full Text]

    Attwood T., M. Croning, D. Flower, A. Lewis, J. Mabey, P. Scordis, J. Selley, W. Wright, 2000 PRINT-S: the database formerly known as PRINTS Nucleic Acids Res 28:225-227[Abstract/Free Full Text]

    Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48[Abstract/Free Full Text]

    Barabási A., R. Albert, 1999 Emergence of scaling in random networks Science 286:509-512[Abstract/Free Full Text]

    Barabási A., R. Albert, H. Jeong, 1999 Mean-field theory for scale-free random networks Physica A 272:173-187[ISI]

    ———. 2000 Scale-free characteristics of random networks: the topology of the World-Wide Web Physica A 281:69-77[ISI]

    Barthélémy M., L. Amaral, 1999 Small-world networks: evidence for a crossover picture Phys. Rev. Lett 82:3180-3183[ISI]

    Batagelj V., A. Mrvar, 1998 PAJEK—program for large network analysis Connections 21:47-57

    Bateman A., E. Birney, R. Durbin, S. Eddy, K. Howe, E. Sonnhammer, 2000 The Pfam protein families database Nucleic Acids Res 28:263-266[Abstract/Free Full Text]

    Bornberg-Bauer E., 1997 How are model protein structures distributed in sequence space? Biophys. J 5:2393-2403

    Corpet F., F. Servant, J. Gouzy, D. Kahn, 2000 ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons Nucleic Acids Res 28:267-269[Abstract/Free Full Text]

    Doolittle R., 1995 The multiplicity of domains in proteins Annu. Rev. Biochem 64:287-314[ISI][Medline]

    Dorit R., W. Gilbert, 1991 The limited universe of exons Curr. Opin. Genet. Dev 1:464-469[Medline]

    Erdös P., A. Rényi, 1960 On the evolution of random graphs Publ. Math. Inst. Hung. Acad. Sci 5:17-61

    Fell D., A. Wagner, 2000 The small world of metabolism Nat. Biotech 189:1121-1122

    Gilbert W., M. Glynias, 1993 On the ancient nature of introns Gene 135:137-144[ISI][Medline]

    Guare J., 1990 Six degrees of separation: a play Vintage Books, New York

    Hofmann K., P. Bucher, L. Falquet, A. Bairoch, 1999 The PROSITE database, its status in 1999 Nucleic Acids Res 27:215-219[Abstract/Free Full Text]

    Huberman B., P. Pirolli, J. Pitkow, R. Lukose, 1998 Strong regularities in World Wide Web surfing Science 280:95-97[Abstract/Free Full Text]

    International Human Genome Sequencing Consortium. 2001 Initial sequencing and analysis of the human genome Nature 409:860-921[ISI][Medline]

    Janin J., C. Chothia, 1985 Domains in proteins: definitions, location, and structural principles Methods Enzymol 115:420-430[ISI][Medline]

    Jeong H., B. Tombor, R. Albert, Z. Oltvai, A.-L. Barabási, 2000 The large-scale organization of metabolic networks Nature 407:651-654[ISI][Medline]

    Kriventseva E., W. Fleischmann, E. Zdobnoy, R. Apweiler, 2001 CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins Nucleic Acids Res 29:33-36[Abstract/Free Full Text]

    Li W.-H., Z. Gu, H. Wang, A. Nekrutenko, 2001 Evolutionary analyses of the human genome Nature 409:847-849[ISI][Medline]

    Marcotte E., M. Pellegrini, H.-L. Ng, D. Rice, T. Yeates, D. Eisenberg, 1999 Detecting protein function and protein-protein interactions from genome sequences Science 285:751-753[Abstract/Free Full Text]

    Milgram S., 1967 The small-world problem Psychol. Today 2:60-67

    Miller G., E. Newman, 1958 Tests of a statistical explanation of the rank-frequency relation for words in written English Am. J. Psychol 71:209-218[ISI][Medline]

    Rubin G., M. Yandell, J. Wortmann, et al. (52 co-authors) 2000 Comparative genomics of the eukaryotes Science 287:2204-2215[Abstract/Free Full Text]

    Schuster P., W. Fontana, P. Stadler, I. Hofacker, 1994 From sequences to shapes and back: a case study in RNA secondary structures Proc. R. Soc. Lond. B Biol. Sci 255:279-284[ISI][Medline]

    Seidel H., D. Pompliano, J. Knowles, 1992 Exons as microgenes Science 257:1489-1490[ISI][Medline]

    Smith C., J. Patton, B. Nadal-Ginard, 1989 Alternative splicing in the control of gene expression Annu. Rev. Genet 23:527-577[ISI][Medline]

    Stoltzfus A., D. Spencer, M. Zuker, J. Logsdon Jr.,, W. Doolittle, 1994 Testing the exon theory of genes: the evidence from protein structure Science 265:202-207[ISI][Medline]

    Tupler R., G. Perini, M. Green, 2001 Expressing the human genome Nature 409:832-833[ISI][Medline]

    Venter J., M. Adams, E. Myers, et al. (271 co-authors) 2001 The sequence of the human genome Science 291:1304-1351[Abstract/Free Full Text]

    Watts D., S. Strogatz, 1998 Collective dynamics of ‘small-world’ networks Nature 393:440-442[ISI][Medline]

Accepted for publication May 14, 2001.