From the McGill Centre for Bioinformatics, 3775 University Street, Montreal H3A 2B4 and the
Department of Biochemistry, 3655 Promenade Sir William Osler, McGill University, Montreal H3G 1Y6, Canada
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
A common technique to find such regulatory relationships is to examine the distinguished set of genes relative to a so-called interaction network (2). High throughput methods, such as ChIP1-Chip, combined with computational approaches have generated transcriptional regulatory networks (3, 4). Proteomic approaches, such as tandem affinity purification tagging (5) and yeast two-hybrid assays, have yielded large scale protein-protein interaction maps (6, 7). Interaction networks are formed from the union of these two data types. Whereas the challenge of proteomics is to generate complete interaction networks for many different organisms and to compile this information in data bases, the challenge of bioinformatics is to automate strategies for extracting meaningful information from these massive graphs. Due to the size and complexity of these networks (30,000 nodes for human), a manual search for regulatory subgraphs is nearly impossible for even a handful of distinguished genes. Several related approaches exist in the literature for investigating interaction networks.
Oyama et al. (8) present a feature selection strategy. Given a protein-protein interaction network and various types of information such as enzyme functionality (EC number), functional annotations (Swiss-Prot keywords), PROSITE motifs, and conserved sequences, they attempt to automatically learn association rules from the data to more accurately characterize protein-protein interactions. A second approach from Lappe et al. (9) uses hierarchically organized functional annotations of proteins to contract or "distill" a protein-protein interaction graph into a more manageable but equally informative representation. A third approach from Kelley et al. (10) entitled PathBLAST allows investigators to compare interaction networks across species to identify pathways and complexes that have been conserved. The framework takes as input a protein "interaction path" and searches for high scoring alignments between pairs of protein interaction paths for which proteins of the first path are paired with putative orthologs occurring in the same order in the interaction network.
A fourth strategy from Ideker et al. (2) essentially maps mRNA expression data on to an interaction network. The weight of a node is a measure of significance of differential expression over one or more microarray hybridizations. The goal is to search for subnetworks that display a statistically significant amount of differential expression. Such subnetworks might correspond to small sets of proteins belonging to a common metabolic pathway for instance. The underlying computational problem is to find a maximum-weight connected component of the network (a set of proteins in the network in which there is a path between any two members). Although a solution may contain nodes that do not exhibit differential expression, such nodes must lie on a path between nodes that do exhibit differential expression. The non-differentially expressed proteins might then represent important "lost connections" between sets of differentially expressed proteins. In Ideker et al. (2) the authors show that these subnetworks succeed in finding proteins that are involved in common biological processes (e.g. the same metabolic pathway).
This study presents a new approach using so-called Steiner trees (see "Experimental Procedures") for investigating interaction networks. The goal is to find subnetworks within an interaction network that are plausibly responsible for regulating a given distinguished set of genes/proteins. Whereas the Ideker et al. (2) approach is a global search of the interaction network for differentially expressed subnetworks, our approach allows the user to specify a "distinguished" set of nodes. This distinguished set might, for example, represent a list of differentially expressed genes from a microarray experiment, chemical-genetic/synthetic-lethal interactions (11), or any other type of assay that produces lists of interesting genes/proteins and where similar questions regarding regulatory relationships are important. The Steiner tree solution provides a "backbone" consisting of genes that are the most likely candidates to be involved in regulatory relationships between members of the distinguished set. Additional nodes and edges are added to this backbone to include as many potential regulatory genes as possible while excluding most irrelevant interactions. Although it is likely impossible to infer a complete causal explanation of how/why a gene is important from an interaction network, the approach does produce subnetworks that are typically of a size that allows investigators to perform a more detailed, literature-based analysis. This data reduction technique is analogous to a BLAST search for a given uncharacterized sequence where the goal is to find a relatively small set of likely homologues within a large data base of sequences.
To investigate the utility of our approach, we performed three sets of experiments using an interaction network for yeast composed of 5,458 proteins and 23,642 interactions from the BIND version 2 (12) (yeast protein-protein interactions), TRANSFAC (13), SCPD (34) (yeast protein-DNA interactions), and yeast ChIP-Chip (protein-DNA) (3) data sets. First, to test the ability of our framework to find plausible regulatory subnetworks, we conducted experiments on the GAL80 deletion expression data set (14). Our results support the claim in Jansen et al. (15) that examining co-expression alone may not always be sufficient for finding regulatory subnetworks and allowed for the discovery of a previously unknown connection between GAL80 and the diauxic shift in yeast. Second, we analyzed the advantages of different interaction networks using a heat shock microarray expression data set (16). This allowed for the identification of a heat shock transcriptional response subnetwork that integrates known but dispersed observations and offers new insights into this process. Third, we present evidence in support of a claim of Ihmels et al. (17) that approximately half of the enzymes in a metabolic pathway are transcriptionally co-regulated and show that our framework accurately rediscovers known regulatory elements of the gluconeogenesis, galactose, glycolysis, and glucose fermentation pathways in yeast.
![]() |
EXPERIMENTAL PROCEDURES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
Steiner Trees
Suppose we have an interaction network and a distinguished set of nodes S, a set of proteins in the interaction graph whose regulation (and possibly co-regulation) we wish to understand. A natural question is to ask how to connect the nodes of S together. Ideally there would be short paths between any two nodes in S. Moreover we would like to "highlight" nodes that are used by many paths between pairs of elements in S. The notion from mathematics that captures these criteria is the Steiner tree. A Steiner tree is a connected subgraph of the interaction network that includes all the distinguished nodes. That is, for any pair of nodes in the distinguished set, there is a path between the nodes in the Steiner tree. Fig. 1b depicts a Steiner tree for the distinguished set of vertices (dark shaded) given in Fig. 1a. The nodes in the tree that are not in S are called Steiner points. Note that nodes such as HAP4, PUT3, and CAT8 lie on the path between many pairs of proteins.
Finding a Steiner tree is trivial. However, what one seeks is a Steiner tree that is optimal in some sense. In the "classical" Steiner tree problem, each edge e is assigned a positive real number w(e) called the weight or cost of the edge. The cost of a Steiner tree T is simply the sum of the weight of each edge in the tree. The problem is to find a Steiner tree with minimal cost. To address the problem of finding regulatory subnetworks, we consider a modified version of this problem called the node-weighted Steiner tree problem. Rather than weighting edges, each node v is assigned cost w(v). This cost typically represents a measure of differential expression associated with v. The problem is to find a Steiner tree T for which the sum of node costs is minimal. Intuitively we are asking to connect the distinguished vertices in the "most compact" way possible. Each edge in the resultant network could then be examined for a regulatory role with respect to the elements of the distinguished set.
Both of these problems have been studied extensively in the computer science literature (1821), and there is likely no efficient algorithm for solving these problems optimally. However, there is an algorithm for the node-weighted Steiner tree problem that guarantees a solution no more than 2log|S| times optimal (20). This means that although we cannot be guaranteed that the algorithm will find the best solution, it will most often produce a sufficiently good solution. A contribution of this study is the observation that it is feasible to solve the node Steiner tree problem if the distinguished set S is small. By modifying the Dreyfus-Wagner algorithm (18) we can solve the node Steiner problem in time on the order of 3k 1·n + n3 time where k is the size of the distinguished set and n is the number of nodes in the interaction network. In practice, k tends to be as small as 10, but n is over 5,000. When k is too large for exact solution to be feasible, we default to the approximation algorithm from Klein et al. (20). In our experiments, this has always given solutions that are optimal or very close to optimal. Moreover there are many small simplifications one can automate that speed up the computation further. Succinctly our software is fast and gives accurate solutions; it is capable of handling the size of distinguished sets typically created by microarray, chemical-genetic, or synthetic-lethal assays.
Weight Functions
Our framework considers two types of weight functions denoted wd and w1. The weight function wd measures the amount of differential expression over a set of m microarray hybridization experiments. Any package for deciding differential expression can be used as long as it returns a p value, pu, as the measure of differential expression for gene u. We assign wd(u) = log(1 pu). Further background corrections can be applied to improve wd, but we omit this discussion here. Intuitively this cost function corresponds to choosing those nodes in the network that connect the distinguished set and show the most differential expression. The weight function w1 simply assigns the weight of every node to be 1. This cost function corresponds to finding the shortest way to connect the nodes in the distinguished set. Intuitively wd corresponds to finding the most likely regulatory explanation, whereas w1 corresponds to the most parsimonious explanation.
Steiner Trees as Backbones to Be Augmented
It is not likely that the Steiner tree alone constitutes a regulatory subnetwork for the distinguished set. Instead it is hoped that the Steiner tree prunes down the search for regulatory networks to a small, manageable number of alternative nodes. In this sense, the Steiner tree is a backbone that we can augment with additional nodes that are also likely to be relevant. Our software is implemented in the VisANT interactive visualization software (22). Beyond the computation of the Steiner tree, we offer several routines that search the neighborhood of the Steiner tree in a rational manner.
The general flavor of how Steiner backbones are extended is captured by the notion of an (i,l)-augmented graph. The intuition is as follows. If a node is within distance i of the Steiner tree (the shortest path to the Steiner tree has length at most i) and if there are at least l different such short paths, then this node is included in the subnetwork. Fig. 3 depicts the (1,2)-augmented tree from Fig. 1. Additional rules for augmenting backbones use functional annotations (including Gene Ontology annotations) to help decide the relevance of the gene/protein.
|
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
We investigated the connectivity between GAL80 and a set of genes S that represents all genes whose average log ratio of expression is greater than +2.0 in these conditions. To test the ability of our framework to find plausible regulatory subnetworks, two new experiments were conducted. In the first experiment, we weighted the nodes of the interaction graph using the function wd (nodes are weighted as a function of the p value of their differential expression, see "Experimental Procedures"). If the proteins that play a regulatory role "between" GAL80 and the other members of S show sufficient co-expression, then intuitively the algorithm should find Steiner trees containing plausible regulatory proteins. In the second experiment, we used function w1 (all nodes are equally weighted regardless of their expression ratio, see "Experimental Procedures"). If it is not the case that there is sufficient differential expression among the regulatory proteins, then intuitively it may be more effective to simply look for the smallest Steiner trees connecting the elements of S.
Table I depicts the lengths of paths between GAL80 and each remaining element of S in an optimal Steiner tree using either w1 or wd. The columns labeled |V| contain the number of nodes between GAL80 and elements of S. The first observation is that the average length of such paths is 10.7 when the wd function is used but only 5.3 when w1 is chosen. We take a 2-fold increase to be significant. Additionally we note that the algorithm includes many more edges from the BIND data base when using wd (65% of the edges) as compared with w1 (
25% of the edges). It tends to be very difficult to construct plausible arguments establishing that the paths found under wd are biologically interesting. BIND contains many interactions from high throughput projects, and there have been several studies suggesting that more than 40% of the interactions in these data sets are not physiologically relevant (6). Although it would be very interesting to find paths that use one or two protein-protein interactions (and thus data from BIND should be included), we are less confident that larger linear chains of protein-protein interactions reflect biological reality; larger chains increase the probability that a false positive interaction pairing has been included. Conversely it is interesting to note that the inclusion of ChIP-Chip data with a conservative p value threshold of 0.001 does not seem to cause the same types of problems. Many of the edges in the networks we found that originate from this data set seem plausible or are at least good targets for further study.
|
Fig. 1 depicts an optimal Steiner tree for S using the w1 function. (There can be more than one Steiner tree with the same cost. We depict one arbitrarily chosen tree from the small set of optimal trees.) Note that all paths between elements of S and GAL80 "pass through" GAL4. This makes some intuitive sense: because GAL80 is known to inhibit GAL4 activity in the absence of galactose, the deletion of GAL80 allows unfettered GAL4 function. It can be hypothesized that GAL4 in turn is somehow responsible for the up-regulation of the remaining elements in S.
The Steiner tree of Fig. 1 contains several known interaction subnetworks. For example, the subnetwork formed by CAT8 and its interactors corresponds to the well studied diauxic shift response of yeast. This response is known to require CAT8, which up-regulates genes involved in gluconeogenesis (such as PCK1 and FBP1) and the glyoxylate cycle (ICL1) as well as in the production of acetyl-CoA from acetate (ACS1), the import of succinate into mitochondria (SFC1), and the import of lactate in the cell (JEN1) (24). Three of these diauxic shift CAT8-activated genes are part of the distinguished set S (ICL1, JEN1, and ACS1), and three additional genes (FBP1, PCK1, and SFC1) are found to be highly overexpressed in the conditions under study. These latter three genes have in fact log expression ratios between 1.7 and 1.9, marginally below our 2.0 threshold for differential expression (depicted as shaded nodes in Fig. 1).
The HAP activator complex is also known to increase the expression of its target genes, which are mostly involved in the tricarboxylic acid cycle and the respiratory chain during the diauxic shift. HAP4, the subunit that provides the principal activation function of the HAP complex, is a Steiner point in the solution for the distinguished set. It allows for the connection of two highly overexpressed genes PHO89 and PUT1. These are likely both required during a diauxic shift (to ensure the appropriate concentration of inorganic phosphate and nitrogen, respectively, in the cell during this cellular response (25)). Taken together, the solution found by the Steiner program suggests that, under these particular growth conditions, the yeast underwent a diauxic shift. This is remarkable because the yeast were exponentially growing in a 2% raffinose solution (14). Under such conditions, glycolysis is the major energy-providing pathway. It has been shown that activation of diauxic shift genes in wild-type yeast occurs well after the initial exponential growth phase (26). Because our solution is based upon GAL80 deletion expression data, this suggests that GAL80 may play a role in controlling the shift from glucose to ethanol metabolism.
Alternative and Augmented Backbones for S
We also experimented with backbones created by the unions of the shortest path from GAL80 to each member of the distinguished set (see "Other Strategies for Discovering Regulatory Subnetworks" under "Experimental Procedures"). Fig. 2 depicts the results. In this analysis, it is interesting to note that apart from GAL4 no other node is common to more than one path between GAL80 and elements of the distinguished set S. In several cases, this alternative strategy provided very little insight into the regulatory relationships between the elements of S. (This approach, which essentially studies each element of S individually with respect to GAL80, is more similar to traditional non-automated analyses one would perform using only the literature.) As such, it renders difficult the task of explaining why the elements of S are simultaneously highly up-regulated under the conditions studied. For example, the backbone described in Fig. 2 does not allow us to conclude that yeast is undergoing a diauxic shift under these conditions. This illustrates an important advantage of the full Steiner analysis when querying relationships among a group of proteins.
|
Transcriptional Regulatory Networks Versus Interaction Networks
We used a heat shock microarray data set available from Gene Expression Omnibus (platform GPL51, series GSE18 (16)) to investigate subnetworks found by the Steiner approach when only protein-DNA or both protein-DNA and protein-protein interactions are used. This data set is comprised of samples collected 20 min after yeast cells are exposed to a temperature shift from either 17, 21, 25, 29, or 33 to 37 °C. The goal is to identify plausible regulatory subnetworks that connect the most highly overexpressed genes in these experiments. Results shown here use the data from the 29 to 37 °C temperature shift experiment (other temperature shift experiments generate similar subnetworks, data not shown). The distinguished set is composed of the 12 most highly overexpressed genes between these conditions (Table II). Two types of experiments were conducted to study the usefulness of using different underlying networks.
|
|
When both protein-DNA and protein-protein interactions are used in the underlying interaction network, the Steiner tree algorithm provides solutions containing elements of transcriptional regulation as well as direct protein-protein regulation. Fig. 5 shows such a subnetwork for the heat shock-distinguished set of genes. In this subnetwork, the only transcription factor present, HSF1, influences the transcription of four genes that encode the proteins SSA4, TSL1, HSP104, and CPR1. HSP104 and CPR1 are involved in protein-protein interactions with other elements of the subnetwork, including LSP1, which is a central Steiner point in this subnetwork.
|
It should be noted that the subnetworks in Figs. 4 and 5 provide useful and different information. Although the transcriptional network in Fig. 4 is easier to interpret and readily suggests a sequence of events (or possibly "causation") following heat shock treatment, Fig. 5 does connect more elements of the distinguished set (because the interaction network is much larger when protein-protein interactions are included) and provides some insight as to which protein complexes are involved in the response.
Finding Regulators of Pathways
Ihmels et al. (17) estimate that approximately half of the metabolic proteins in any pathway exhibit co-expression. Our methodology allows us to further evaluate this claim and to identify regulators of these metabolic proteins in the interaction network. We use the 20 microarray results from Ideker et al. (14) containing wild-type and single deletions of GAL genes, 10 metabolic pathways from Saccharomyces Genome Database (25), and an interaction network consisting of data from only TRANSFAC, ChIP-Chip, and SCPD (BIND is excluded from this analysis because we want to only look at transcriptional regulation).
The pathways chosen can be divided into two groups based on their proximity to the galactose pathway in the yeast metabolic network. Table III lists these pathways. The "Close to galactose" category contains those pathways that are nearly adjacent to the galactose pathway, whereas the "Far from galactose" category contains several pathways that are not directly related to galactose. For each pathway, we begin by identifying a set of co-expressed genes based on their pairwise correlation coefficients of the log ratio expression levels over all 20 conditions. A subset of the metabolic proteins were chosen if their correlation coefficient was sufficiently high (a conservative >0.35). We computed a Steiner tree using the weight function w1 for the set of co-expressed genes in each pathway. If the co-expression of these genes is due to the fact that the genes share a common set of regulators, then we expect to find relatively small Steiner trees. If so, this may in turn imply that a small number of related transcription factors are responsible for the expression of these genes.
|
|
The pathways that are far from the galactose pathway serve as a control and, as expected, have Steiner trees that are close to the expected size of a random tree (Table III, R and p value columns). Because the expression experiments perturb the galactose pathway (through the deletion of GAL genes), the genes we identify as co-expressed in these pathways are likely unrelated to galactose metabolism and have no obvious physiological relevance (because this set of genes is essentially chosen randomly, we do not expect to detect co-regulation). This explains why the sizes of the optimal Steiner trees are not significantly different from the size of random trees for these pathways.
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Our framework has identified an interesting subnetwork that suggests a new connection between GAL80 and the diauxic shift. The diauxic shift occurs when glucose and other fermentable carbon sources have been exhausted in the growth medium; this was not the case in the experimental conditions used by Ideker et al. (14). Only those cultures with a GAL80 deletion mutant resulted in elevated expression levels of diauxic shift genes. We propose that these cells are prematurely activating or priming the diauxic shift response and that GAL80 contributes to the correct regulation of induction of this response in wild-type cells. It has been observed that GAL80-null cells grow slowly in raffinose (14), and this fact may be a reflection of the competition between two opposing pathways in these cells.
We also show that our framework when applied to graphs containing only protein-DNA interactions is capable of finding plausible transcriptional regulatory subnetworks. In particular, the framework shows that the general transcription factors RAP1 and HSF1 directly regulate a third of the most overexpressed genes in a temperature shift from 29 to 37 °C.
Gene expression experiments often incorrectly identify genes as being differentially expressed due to the technical and biological errors associated with microarrays. The subnetworks approach in this study can complement current statistical approaches to locating these false positives. If genes from the distinguished set are not closely connected to remaining members (and are therefore not physiologically related to the remaining members), such genes are possibly more likely to be simply experimental artifacts and could be considered less interesting. It is estimated that current transcriptional regulatory networks capture only 10% of existing transcriptional regulatory relationships (33). Such false negatives in the interaction network may result in Steiner trees containing longer paths that are not biologically relevant (or at least very difficult to rationalize). False positive edges in the interaction network may also cause our approach to return irrelevant trees. Our implementation explicitly labels edges with the source of the interaction to help investigators navigate such issues. Nevertheless, as the accuracy of the underlying interaction networks improves, the framework will be able to find more plausible regulatory subnetworks.
The framework presented here is a first step. Better weight functions for the nodes in the interaction network remain to be developed to determine which such functions most closely reflect biological reality. Expression measurements alone are unlikely to return meaningful results in many contexts. It is also possible to weight (with likelihoods) our confidence in each interaction and integrate more types of data into the interaction networks. Weightings that incorporate Gene Ontology classifications and algorithms that use this information to make choices as to how to build and augment the backbone may provide a solid improvement. The challenge will be to find accurate ways to compute these more complex problems.
![]() |
ACKNOWLEDGMENTS |
---|
![]() |
FOOTNOTES |
---|
Published, MCP Papers in Press, February 18, 2005, DOI 10.1074/mcp.M400110-MCP200
1 The abbreviations used are: ChIP, chromatin immunoprecipitation; HAP, heme activated protein.
* The material presented in this paper is based upon work supported by the National Science Foundation under a grant awarded in 2002.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
¶ Supported by Canadian Institutes of Health Research Canada Graduate Scholarship.
** To whom correspondence should be addressed. Tel.: 514-398-5928; E-mail: hallett{at}mcb.mcgill.ca
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
|
HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
All ASBMB Journals | Journal of Biological Chemistry |
Journal of Lipid Research | Biochemistry and Molecular Biology Education |