Exploring genetic regulatory networks in metazoan development: methods and models*

Marc S. Halfon and Alan M. Michelson

Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Howard Hughes Medical Institute, Boston, Massachusetts 02115


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
One of the foremost challenges of 21st century biological research will be to decipher the complex genetic regulatory networks responsible for embryonic development. The recent explosion of whole genome sequence data and of genome-wide transcriptional profiling methods, such as microarrays, coupled with the development of sophisticated computational tools for exploiting and analyzing genomic data, provide a significant starting point for regulatory network analysis. In this article we review some of the main methodological issues surrounding genome annotation, transcriptional profiling, and computational prediction of cis-regulatory elements and discuss how the power of model genetic organisms can be used to experimentally verify and extend the results of genomic research.

genomics; cis-regulation; microarrays; computational biology


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
THE RECENT AVAILABILITY of whole genome model organism sequences, genome-wide transcriptional profiling methods, and computational tools for exploring these rapidly accumulating new data has set the stage for the study of developmental genetic regulatory networks to come into its own as a vital component of 21st century experimental biology. While studies of gene regulatory networks in yeast (reviewed in Ref. 111) illustrate the power of combining genomic techniques with genetic analysis, delineating the networks that mediate metazoan development, with their more complex transcriptional regulation and multiple interacting intercellular signaling pathways, presents additional challenges. In this article, we will review some of the recent methods that are now available for characterizing developmental regulatory networks with a focus on how model organism genetics can provide a framework for applying and empirically validating the data from these approaches.

Metazoan development occurs by the progressive determination of cells from a pluripotent undifferentiated state through successive states of more and more restricted developmental potential, until the full complement of mature terminally differentiated cells have been specified. While various epigenetic and environmental factors can influence this developmental program, in essence its course is driven by a series of genetic regulatory networks set into motion at the beginning of embryogenesis (see Refs. 20, 25). These networks receive input in the form of intercellular signals and output instructions in the form of the regulated expression of specific genes. The particular combination of signals received by a given cell comprises a form of "signaling code," which is transduced to the nucleus to create a "transcriptional code" composed of both signal-induced and tissue-specific transcription factors (TFs). The linchpins of the regulatory networks are the cis-regulatory elements that directly control gene expression through interpretation of the transcriptional code and thus act as sites of integration for the combinatorial action of multiple signal transduction pathways and tissue-specific selector proteins (Fig. 1).



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. Signaling and transcriptional codes in development. Intercellular signals A, B, and C here comprise a "signaling code" received by a cell, initiating signal transduction events that lead to the binding of pathway-specific downstream transcription factors (A', B', C') to a cis-regulatory module. Also binding to the regulatory DNA are tissue-specific (i.e., not general nuclear effectors of the signaling pathways) transcription factors (TFs) D and E. The resulting A'-B'-C'-D-E "transcriptional code" acts on the regulatory module to activate gene transcription. For simplicity, cross talk between the pathways has not been illustrated, but such interactions are a common component of genetic networks (for instance, see Fig. 3C).

 
Although the notion of genetic regulatory networks is not a new one (13), elucidating them has proven a difficult task; surprisingly few have been defined in any detail, with perhaps only the network involved in sea urchin endomesoderm specification described comprehensively (26). The techniques that have worked so well for investigating the functions of individual genes and essentially linear signaling pathways (e.g., forward and reverse genetics, targeted gene knockouts, and gene misexpression studies) turn out to be not nearly as well-suited for exploring complex networks (although these methods continue to play a critical role in testing and validating new hypotheses about network components, as discussed below). A major limitation here is a paucity of knowledge regarding what genes are at play, and what their functions are, in any given process. One clear lesson from the sequencing of the worm, fly, and human genomes is just how many previously unidentified genes await characterization. Genetics has been a powerful tool for identifying individual network members, but extensive pleiotropy and redundancy limits the amount of information that is likely to be gathered by genetic screens alone. Moreover, these methods cannot reveal the detailed molecular information about cis-regulatory elements that is a critical component of comprehending regulatory networks. Fortunately, the explosion of new genomic technologies is making the study of these networks an ever more tractable problem. Fully sequenced genomes, microarray platforms that allow for genome-wide transcriptional profiling, and the development of computational tools for identifying TF binding sites and cis-regulatory modules are beginning to provide a wealth of data that can be combined with more traditional genetic analyses in forming an integrated understanding of embryonic development.


    Genome Annotation
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
The rapidly growing availability of whole genome and expressed sequence tag (EST) sequences is an invaluable resource for the study of regulatory networks. The number of eukaryotic organisms undergoing sequencing is approaching 200 (see http://wit.integratedgenomics.com/GOLD/eukaryagenomes.html) with the inclusion of a substantial number of metazoans. Essentially complete are the genomic sequences of model organisms such as Caenorhabditis elegans (23) and Drosophila melanogaster (1), and a number of vertebrate genomes have been completed or are near to completion, including human (64, 103), mouse (http://www.ensembl.org/Mus_musculus/), and puffer fish (Fugu rubripes; http://genome.jgi-psf.org/fugu3/fugu3.home.html).

A large part of the utility of genomic sequence data lies in the accuracy of the corresponding gene annotation; correct knowledge of the genes encoded by the genome is critical for constructing tools such as microarrays to assess the full complement of genes expressed at any time or place and to identify the components of each functioning network. Genome annotation requires careful curation to integrate data for genes known from traditional molecular cloning studies and EST sequencing with computational predictions based on homology searching and ab initio gene finding programs. When used in combination, these methods function surprisingly well in terms of identifying genes, although they are significantly weaker in their ability to correctly predict intron/exon structure and to detect 5' noncoding exons and transcription start sites (22, 84). Continued refinement of annotation and gene prediction methods is clearly required. For example, a variety of systematic studies conducted since the initial annotation of the Drosophila genome (1) have identified a large number of unannotated or incorrectly annotated genes (2, 42, 57, 88). Homology-based searching using sensitive methods such as PSI-BLAST and extensive sequencing of tissue-specific EST libraries (which are more likely to contain rare transcripts not widely expressed in other tissues) are two methods that have yielded good results (2, 42), providing a strong argument in favor of continued EST sequencing even for organisms for which genomic sequence is available. The ability to compare sequences of related species should also be of significant help here, as coding regions tend to be highly conserved among sister species (88). To this end, it is gratifying that a number of sister species or close relatives to already-sequenced organisms are currently undergoing or slated to undergo sequencing, including Caenorhabditis briggsae, Drosophila pseudoobscura, rat and mouse, and the mosquitoes Anopheles gambiae and Aedes egypti. Interspecific comparisons will also be of enormous benefit in identifying regulatory regions (see below).

An additional challenge to genome annotation efforts lies in the identification of noncoding RNAs (29). A large variety of noncoding RNAs are known to exist, including the rRNA and tRNA genes, the snRNAs that function in the spliceosome, the Xist and rOX RNAs that function in X chromosome inactivation and dosage compensation in mammals and flies, respectively, the small nucleolar RNAs (snoRNAs), and the "micro" RNAs that appear to play an important role in mRNA regulation (30, 87). Noncoding RNAs tend to score poorly in gene-finding programs due to their lack of an open reading frame, and their diverse nature has made it difficult to develop computational methods for their detection. Homology searches have been among the most effective ways of finding new noncoding RNAs, but a lack of strong sequence conservation means that closely related genomes are required for comparison. Moreover, it is difficult to distinguish noncoding RNAs from conserved regulatory regions. Extensive Northern analysis or hybridization to microarrays containing tiled whole genome sequence (56, 79, 92) may be required to distinguish between these possibilities, but these are labor- and resource-intensive methods that will not quickly be applied to large, whole genomes. A full and accurate accounting of all the varied noncoding RNAs, let alone their functions, in eukaryotic genomes may thus be a long time in coming.


    Unraveling the Transcriptome
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
With sufficient genome data in hand, it becomes possible to move on to full-scale analysis of the transcriptome, the population of genes expressed at a particular time or in a specific tissue, or in response to a defined set of experimental conditions. Given the well-established ubiquitous role of transcriptional regulation in biological systems (25), virtually any developmental process is amenable to investigation by gene expression profiling. The major limitation of this approach in any given context is the accessibility of suitable material for analysis and the ability to perturb the system by genetic or other means. The most thorough way of transcriptional profiling may be serial analysis of gene expression (SAGE; 102), wherein the sensitivity of the assay is limited mainly by the extent of DNA sequencing the investigator is able to undertake. In practice, however, this can become an expensive and labor-intensive endeavor, and microarray technology is emerging as the method of choice for transcriptional analysis.

Although microarrays, both cDNA or long oligonucleotide arrays and high-density Affymetrix GeneChip arrays, continue to gain in popularity and are rapidly becoming a mainstream experimental technique, consensus is still lacking as to the best methods to use in evaluating the data obtained from microarray experiments. This is a nontrivial issue, as experiments in which 15,000 or more genes are interrogated simultaneously build up data with great rapidity and demand sophisticated computational and statistical analytical techniques unfamiliar to many traditionally trained biologists. A common metric used for analysis of microarray data has been a simple "fold-change" cutoff, in which an arbitrary ratio (frequently 2:1) of gene expression values is used to define a change in expression level. However, such a cutoff carries little statistical authority; genes with high variability in expression in replicate experiments can exhibit a large apparent fold change but little statistical significance, and conversely, genes showing a small fold change can nevertheless be shown to change with high significance due to their low variance among replicates (Fig. 2).



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 2. The importance of statistical evaluation of microarray data. Illustrated are expression data showing the average relative activation of two genes ("A" and "B") compared with a control condition or reference sample from 3 replications of a microarray experiment (Choe S and Halfon MS, unpublished data). Using a simple fold change cutoff of 2-fold, gene B would be called upregulated due to its apparent 2.4-fold activation, whereas gene A would be considered unchanged due to its 1.7-fold increase. However, gene B displays substantial variation and might not satisfy statistically valid criteria for being considered activated. Gene A, on the other hand, is consistently upregulated by over 1.5-fold and may thus represent a true activation despite the smaller magnitude of its change.

 
Fortunately, the importance of replication and proper statistical analysis is becoming widely recognized as a critical component of microarray analysis. A compelling case has been made by a number of laboratories for the use of well-proven analysis of variance (ANOVA) models for use in both normalization of arrays between experiments and evaluation of the significance of microarray results (28, 5860, 109, 112). Used in conjunction with thoughtful attention to experimental design, these methods can reduce the number of microarrays required for an experiment while simultaneously improving the statistical power of the analysis (21, 59). For example, the popular reference sample design for cDNA arrays in which each array is simultaneously hybridized with an experimental RNA and a common "reference" RNA is not the most efficient method, as one-half of the data is devoted to the reference, rather than to meaningful experimental samples. Kerr and Churchill (59) propose a "loop" design in which each sample is compared directly with every other, i.e., A->B, B->C, C->A, allowing more data to be gathered about each experimental condition. In this example, for instance, one obtains twice as much data as the reference sample method would allow, without requiring more arrays to be used. However, loop designs can become unwieldy when large numbers of experimental conditions are to be compared, and loop designs also suffer from being "closed"; that is, with such a design, it is difficult to come back later to add one new sample for comparison. Devising other experimental designs that overcome some of the limitations of both the reference-sample and loop designs should thus remain an area of priority in the microarray field.

Another important contribution has come from Tusher et al. (100), whose SAM ("statistical analysis of microarrays") program introduces a practical solution to the multiple hypothesis testing problem (i.e., the special statistical considerations that come into play when testing numerous hypotheses simultaneously, as when looking at expression changes in thousands of genes) with their implementation of a false discovery rate (FDR) controlling procedure. Traditional approaches to multiple hypothesis testing, such as the Bonferroni correction (94), address the probability of having any false-positive results (type I errors). However, these corrections are often so stringent as to allow few if any genes to pass a significance score cutoff (e.g., for a P < 0.05 level on a 15,000 gene array, a classic Bonferroni correction would require a P value of 0.05/15,000 = 3.3 x 10-6). On the other hand, the FDR procedure sets an upper bound on the number of potential false-positive results, rather than ruling them out completely. This is a fair tradeoff in many microarray experiments, where a small number of false-positive results might be tolerable among the entire data set. For Affymetrix GeneChip arrays, the Affymetrix Microarray Suite version 5 incorporates measures of statistical significance, a substantial improvement over earlier, empirically based versions of the software. An attractive alternative to the Affymetrix software is the independently developed dChip software of Li and Wong (67), which uses statistical models to perform normalization and a variety of downstream analyses of GeneChip data. Numerous other approaches have been suggested for analysis of both cDNA and Affymetrix arrays based on t-statistics, nonparametric (e.g., Wilcoxon) tests, regression analysis, and Bayesian methods (inter alia, 31, 55, 77, 99).

This growing multitude of proposed analysis methods leaves the investigator with the dilemma of choosing from among many possible approaches. Unfortunately, little critical assessment of the various methods has yet been performed. Although comparisons have been made among the different strategies (77), these have typically been done using existing data sets in which the "correct" patterns of expression are not known. Thus differences in how the methods perform can be seen, but it is not known which results most accurately reflect true changes in gene expression. Analysis of a microarray experiment in which a large number of RNAs of known relative concentration are used could provide invaluable insight into microarray sensitivity and the relative effectiveness of the different existing methods for statistical analysis. A step in this direction has been taken by Irizarry et al. (51), who have compared a number of methods for analysis of Affymetrix data based on a set of spiked-in control RNAs. More extensive testing along these lines using a larger set of controlled RNAs will be a welcome contribution to the microarray field, both for evaluating existing algorithms and for testing new ones, as will continued community-wide assessment efforts such as those sponsored by the CAMDA competition (Critical Assessment Of Microarray Data Analysis; http://www.camda.duke.edu/).

Although statistical methods are necessary for reliably assessing what genes are differentially regulated in microarray experiments, useful overviews of the data can be obtained from clustering algorithms (24, 83, 91). Clustering is an important tool for developing hypotheses based on perceived patterns of gene expression and is particularly well-suited for issues of class discrimination, such as differentiating tumor types from transcriptional profiling of biopsy samples (81). Clustering can also provide hints as to what genes may be acting in common pathways, based on the assumption that such genes will have well-correlated expression. Commonly used clustering algorithms include both hierarchical methods, in which clusters are linked together in successively larger clusters, and nonhierarchical methods, such as k-means clustering and self-organizing maps, which partition the data into a discrete number of groups. Vector algebra approaches, such as principal component analysis, also provide a flexible and useful way of visualizing and sorting out microarray data and are particularly good at reducing the complexity of the data (63, 83). The latter is a major goal of all clustering methods, as microarray experiments contain a great deal of noise in the form of "background" gene expression unrelated to the particular condition being assessed, which can often obscure subtle but important actual changes in expression. Filtering genes prior to clustering using statistical criteria, as discussed above, also helps to ensure that only the most relevant genes are used in the analysis.

It is important to note that there is no "right" clustering approach to use for microarray analysis. Clustering outcomes will vary considerably depending on the algorithm used and the manner in which the data are processed prior to clustering, and it is often worthwhile to cluster the data in a number of different ways to see which method provides the most useful view for the questions being addressed. For instance, one clustering algorithm might provide a good grouping of genes based on high vs. low levels of expression, whereas a different clustering of the same data might identify periodic fluctuations in expression such as those that are associated with genes subject to cell cycle or circadian regulation. Genes involved in regulating numerous cellular processes provide a particular challenge to cluster analysis as they could properly fit into many biologically relevant clusters. Ultimately, it lies with the investigator to decide what views of the data provide the most useful biological picture and generate the most reasonable hypotheses for further testing.

Transcriptional profiling experiments in metazoans bring with them additional challenges not encountered with single-celled organisms such as yeast and bacteria. Large among these is the heterogeneity of cell and tissue types within the animal, which makes it difficult to isolate pure cell populations for profiling. This is particularly disadvantageous when the tissue of interest might constitute only a small fraction of the experimental sample; in this case, a gene that changes dramatically in that tissue but is expressed at a high but constant level in the contaminating cells will be difficult to detect. One solution is to mark individual cell populations using a fluorescent protein reporter construct and to isolate the marked cells by flow cytometry (14); another is to isolate cells by laser-capture microdissection (32). However, the difficulty of obtaining large amounts of tissue by these methods carries its own complications, as an RNA amplification method must then be used to generate enough RNA for microarray experiments. Both PCR-based and linear amplification methods can be used for this purpose (7, 69, 101), but it is still unclear to what extent such manipulations might skew the RNA population and affect the accuracy of the transcriptional profiling results.


    Motifs and Modules
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
Transcriptional profiling can reveal the genes that are expressed in a tissue, but for an understanding of the network architecture underlying this expression we must also turn to a characterization of the cis-regulatory mechanisms that modulate it. Traditionally, analysis of cis-regulatory elements (i.e., enhancers and promoters) has been a laborious, gene-by-gene, empirically based approach that entailed the isolation and testing of the 5' and 3' flanking DNA and intronic sequences of the gene of interest in reporter gene assays. The growth of genomic sequence data, however, promises to greatly accelerate the task of regulatory module identification and characterization.

A key step in defining cis-regulatory elements is the identification of individual TF binding sites (motifs). A substantial number of binding site motifs are known from years of study of individual TFs and enhancers, and many of these have been collected into searchable databases such as TRANSFAC (108). When TFs are known, in vitro binding site specificities can be found by methods such as SELEX (12); the development of rapid, high-throughput methods such as protein binding to DNA microarrays (15, 16) promises to greatly accelerate the pace of such determination. In vitro binding affinities do not always accurately reflect where TFs are bound in vivo, however, as clearly illustrated by promising new methods in which chromatin immunoprecipitation (ChIP) is combined with DNA microarray analysis to generate a genomic profile of TF binding (11, 52, 68, 85). Taking full advantage of ChIP methods will require the generation of microarrays with whole genome noncoding sequence. This is no small task with respect to the higher eukaryotes, although smaller-scale efforts based on sequencing of immunoprecipitated DNAs or using microarrays targeted at specific sequences of particular interest are immediately feasible. As even in vivo binding does not necessarily always correlate with regulatory function (11, 52, 68, 85), a combination of ChIP and transcriptional profiling should prove to be an important and powerful combination of analyses (9).

Computational approaches, although unable to assign specific TFs to their cognate binding sites, can be of tremendous aid in identifying relevant binding site motifs (76, 80). The most successful of these methods to date appear to be those that rely on gene expression profiles. Genes are clustered based on common expression (see discussion above), and their flanking genomic regions are aligned and analyzed for the presence of common sequence motifs using any of a number of algorithms developed for this purpose. Among the more commonly used of these programs are CONSENSUS (47), MEME (4), AlignACE (48, 86), and the Gibbs sampler (65) (for a review of the basic theory behind these approaches, see Ref. 96). In yeast, where regulatory elements tend to be promoter-proximal, these methods have worked well (86, 98). Extension to the higher eukaryotes, where regulatory regions are more extensive and can lie many kilobases in either direction from the promoter or within introns (3), introduces significant complexity into such approaches, and false-positive rates can be high (see below). Identifying motifs that occur in pairs should help in terms of both specificity and sensitivity (43). However, not all motifs are consistently paired with others, and developing appropriate measures to evaluate the significance of motifs discovered through multiple alignment remains an important priority.

An attractive variation on the multiple alignment theme inverts the process by beginning with identified motifs and then determining whether they are associated with coexpressed genes. The motifs can be drawn from known TF binding sites or from computational selection based on statistically overrepresented "words" in genomic sequence (17, 54, 82). Motifs that are associated with the same expression profiles are good candidates for motifs that act together to jointly regulate gene expression. Pilpel et al. (82) introduce a series of tools for constructing "combinograms," which aid in the visualization of relationships between motifs and expression profiles. Again, however, challenges lie in the fact that regulatory elements can be distant and to either side of a gene, making it difficult to be certain which motifs match individual expression profiles.

Although identifying TF binding sites is a necessary step, the true key to cis-regulation lies in the modular nature of cis-regulatory elements, each containing a number of binding sites for an assortment of TFs and directing gene expression in a particular spatiotemporal pattern. Identifying not just individual binding sites but rather entire regulatory modules is thus a critical component of defining regulatory networks. Again, a variety of computational techniques have been developed to aid in this task (33, 76, 80).

One way to bypass the difficulties inherent in the fact that metazoan cis-regulatory modules can be far from the genes they regulate is to consider only promoter sequences. Since these sequences lie close to the transcription start site, the sequence space that must be analyzed to find regulatory modules and their associated binding site motifs is vastly reduced, provided, of course, that the transcriptional start can be accurately located. The presence in many promoters of the TATA box provides some help in this regard, although the low complexity of this site means that false-positive rates will be high when it is used as the main criterion. In mammals, the presence of CpG islands near many promoter sequences allows for better discrimination. A recent algorithm by Davuluri et al. (27), which incorporates information on CpG islands, RNA splice-donor sites, and models built from known promoter and first exon sequences, displays a strong ability to recognize transcription start sites associated with CpG islands, but a much lower ability to discern those that are not CpG related. A database of transcriptional start sites for human genes has been created (97) and can thus provide a rich source of raw data for undertaking searches for promoter-proximal regulatory modules.

What of more distant regulatory elements? One powerful method for locating these lies in comparative genomics, wherein homologous regions from related organisms are aligned and then searched for conserved blocks of sequence (also referred to as "phylogenetic footprinting"). This approach has proven valuable not only on a gene-by-gene scale, but also on a genomic basis (46), and several useful tools have been developed to produce graphical views of aligned genomic sequences complete with gene and, in some cases, TF binding site, annotation (53, 70, 72, 89, 113). Nevertheless, phylogenetic footprinting alone is usually insufficient for defining the regulatory elements involved in a particular developmental network as it is not possible to tell, merely from alignment data, the functional role of an identified conserved sequence block. Again, as conserved elements may be found in intergenic regions far from any coding sequence, it is not always clear what gene is subject to regulation by the element. Moreover, the amount of divergence between the two sequences being compared has a significant impact on how useful the comparison will be (46); in some cases when comparing mouse and human sequences, for example, the conservation is so extensive that individual regulatory elements cannot be distinguished. In these instances, comparison with a more diverged vertebrate sequence will be of higher value. Thus, despite the potential power of comparative sequence-based approaches, they will have limited utility for conducting genome-wide searches for regulatory sites until sufficient whole genome sequencing of related organisms has been conducted.

A number of model-based approaches to cis-regulatory element prediction have been developed that rely on the prior characterization of several regulatory modules of a similar functional class; models generated from known vertebrate muscle gene enhancers and retroviral long terminal repeats (LTRs) have been effectively applied (36, 107). These methods use a number of similar well-characterized enhancers to formulate a model that can serve as the basis for a genomic search for other related elements. The models can be highly specific, requiring a number of motifs in a particular order with specific spacing (36, 40, 61), or more general, seeking only a common occurrence of a number of specified motifs (62, 107). One of the most promising methods described so far is the logistic regression analysis used by Wasserman and colleagues (62, 107) to search for elements regulating muscle and liver gene expression, which has been combined with a phylogenetic footprinting approach to better distinguish true regulatory modules from false-positive results. A drawback to most of the model-based approaches is that they require knowledge of a number of regulatory elements of a similar type, which is available for only a limited number of gene expression patterns. In addition, the false-positive rate for all of these methods appears to be high, exceeding 50%.

A third type of approach has relied on the identification of dense clusters of known TF binding sites as the basis for computational prediction of regulatory elements (10, 37, 71, 78). Clustering of binding sites is a known characteristic of cis-regulatory modules (25), and this approach has led to the identification of at least two previously unidentified enhancers (10, 71). Here again, the false-positive rates are high. Furthermore, as with the model-based approaches, a substantial number of known elements appear to be required to select an appropriate site-density cutoff (10) or to train a probabilistic model (37). Methods based on a single TF (71) may not be widely applicable, as not all important TFs are represented in large numbers in regulatory elements (for example, two enhancers we have characterized in detail contain only a single site for the Wnt/Wg-responsive TF Tcf) (44, 45). Also, these methods are highly sensitive to the length and extent of degeneracy of the motif consensus sequence (i.e., a motif’s information content). Shorter and/or more degenerate sites, for example, the E-box (CANNTG) sequence to which bHLH-family TFs bind (75), will occur by chance more frequently, and therefore cluster more frequently, than sites with high information content, leading to a higher false-positive rate. These considerations must become an important component of any predictive strategy that relies on site number or density. Use of a combinatorial strategy, in which several TFs are considered together, should help to increase the selectivity of the approach, especially when the TFs are chosen to reflect a particular transcriptional code known to direct expression in a specific temporal-spatial pattern (45). The use of multiple concurrent strategies, such as motif number, interspecific sequence conservation, and information based on existing model enhancers should also be of significant aid in reducing false-positive rates.

The importance of empirical validation of predicted elements cannot be overstated. The most commonly used criterion in assessing the success of regulatory element prediction algorithms has been to determine whether the predicted element maps near to a gene with the expected expression pattern; those that map in proximity to such a gene are considered true-positive results, while those that do not are deemed false positives. However, in vivo testing of predictions that we have made demonstrates that this criterion can be misleading: although predicted elements can lie close to genes with the expected expression pattern, they do not always turn out to be bona fide regulatory elements when tested using reporter gene assays (45). Unfortunately, actual testing of predicted elements has not played a large role in the studies so far reported, and false-positive rates may be considerably higher than generally appreciated.


    The Power of Genetics
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
Fundamentally, understanding developmental regulatory systems is a job for hypothesis-driven experimental biology, a fact sometimes overlooked in the trend toward high-throughput genomics and computational approaches. Despite the power of these techniques and their ability to vastly accelerate the pace of discovery, to understand the workings of a regulatory network we must know more than only what genes are expressed in a tissue and the composition of their cis-regulatory elements. We must also know the individual inputs into the system at each level, in terms of signaling pathways and transcriptional effectors, and how perturbations at each level of the system affect the system as a whole. In the system that has been most thoroughly explored to date, sea urchin endomesoderm specification, a combination of experimental techniques were used to dissect each component of the system, including gene disruption via antisense RNA injection, use of dominant-negative proteins, overexpression of signaling molecules, site-directed mutagenesis of specific TF binding sites and cDNA macroarrays (26). This testing has revealed important and in some cases surprising results, such as the considerable role played by repression of gene expression and the existence of extensive negative feedback loops, for example in the repressive "subnetworks" that both prevent endomesodermal genes from being downregulated in the endomesoderm and from being induced outside of the endomesoderm (26). In turning to genetic model systems for analysis, an expanded range of experimental tools become available including whole genome sequence, whole genome microarrays, and both gain- and loss-of-function genetic analysis. In yeast, a combination of genetic, genomic, and proteomic methods has been used to define the galactose signaling pathways (49). In Drosophila, fluorescent embryo sorting has been used to accumulate homozygous mutant embryos for transcriptional profiling using microarrays, and individual cell populations can be isolated as well (14, 38, 39). Recent work using adult Drosophila provides an elegant demonstration of how a combination of genetics, cell culture, and double-stranded RNA interference (RNAi) can be used together with whole genome microarray analysis to address specific hypotheses regarding the various signal transduction pathways that comprise the innate immune response (M. Boutros, H. Agaisse, and N. Perrimon, unpublished observations).


    Computer Simulation and Mathematical Modeling
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
As our identification of components of genetic regulatory networks grows, keeping track of the many different regulatory interactions within the network and testing network models will become increasingly complicated. As network models are built up through the experimental techniques outlined above, therefore, computer simulation will play a useful and perhaps necessary role in organizing and testing the models and in highlighting missing interactions and erroneously assembled network connections. The simulation data will serve as a guide to the experimental biologist in correcting flawed hypotheses and in suggesting new hypotheses and directions of investigation. Already, modeling and simulation have proven useful in understanding such diverse situations as segmentation in Drosophila (105), lateral inhibition during neurogenesis (73), receptor tyrosine kinase signaling (93), and sea urchin endomesoderm development (26). In many instances, these approaches are limited by the difficulties inherent in obtaining the types of accurate quantitative data describing the particular biological system that are necessary for formulating a specific mathematical model. Nevertheless, computer modeling and simulation are clearly areas ripe for further development, and as computer scientists and mathematicians begin to take an increasing interest in analyzing biological systems, these techniques are likely to develop into an important research tool for regulatory network analysis.


    Deconstructing a Network: an Example From Drosophila
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
As an example of how the many lines of investigation discussed in this review can be used to gradually build up a comprehensive picture of a regulatory network, we focus in this section on the specification of a set of cardiac and somatic muscle progenitor cells in the Drosophila dorsal mesoderm. These cells, which can be distinguished by their expression of the TF even skipped (eve), provide an excellent entry point into a large-scale developmental regulatory network analysis (8, 34, 41). The long history of developmental genetics in Drosophila provides a context for this analysis: we can approach it already possessing a detailed understanding of the initial events leading to specification of the dorsal mesoderm, viz., establishment of dorsal fates stemming from the maternal gradient of the morphogen dorsal, and segmental patterning of the mesoderm deriving from the early activities of the gap, pair-rule, and segment polarity genes (Fig. 3A) (66). Directly resulting from these early patterning events are orthogonally intersecting domains, in each embryonic hemisegment, of the activities of the Wingless (Wg) and Decapentaplegic (Dpp) signaling pathways (Fig. 3B). These signals provide competence for the specification of the Eve-expressing progenitors by inducing the expression of the mesodermal selector proteins Twist (Twi) and Tinman (Tin). They also induce localized activity of the Ras/MAPK pathway, which acts through the TF PointedP2 (Pnt), by virtue of their regulation of receptor tyrosine kinase signaling pathways (6, 19, 35, 44, 95, 110). Furthermore, they act directly to induce eve expression via their downstream TFs, "mothers against Dpp" (Mad) and dTcf (Pangolin) (44). The key to this network is the integration of these events at the level of cis-regulation of the eve gene itself, through a 300-bp regulatory module, the "muscle and heart enhancer" (MHE) (Fig. 3, C and D) (44).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 3. Defining regulatory networks in the Drosophila dorsal mesoderm. A: a long history of Drosophila developmental genetics has described the initial pattern-formation events that lead to specification of the mesoderm (indicated by Twist expression, arrow in left embryo) and segmentation (Even skipped expression, blue stripes in left embryo; Wingless expression, brown stripes in right embryo) and thus provides a context in which to place an analysis of the immediate events surrounding induction of cardiac and somatic muscle progenitors (Eve expression, arrowheads in right embryo). B: detailed genetic analysis coupled with gene expression studies using immunohistochemistry and whole mount in situ hybridization to RNA has revealed the roles of the orthogonally intersecting domains of Wg and Dpp expression and localized activation of the Ras/MAPK pathway, as providing a signaling code for specification of the Eve-positive progenitors (19). C and D: additional genetic studies along with reporter gene analysis in transgenic embryos provided a more detailed picture of the actions of this signaling code, including the presence of positive feedback loops, lateral inhibition, and cross talk between the signaling pathways, as well as an understanding of how the signaling code leads to a transcriptional code consisting of the TFs Mad, dTcf, Tin, Twi, and Pnt; these TFs are integrated at the Eve "muscle and heart enhancer" (MHE) cis-regulatory module to activate expression of eve in the progenitors (18, 44). Direct regulation is indicated with solid black arrows; dashed gray arrows represent regulation which is not yet known to be direct or indirect. E: a computation-based genome-wide search for additional cis-regulatory modules containing binding sites for the five TFs that act at the MHE led to the discovery that the hbr gene is regulated in a manner similar to eve (45); additional putative elements for other genes are still under investigation. F: microarray experiments, currently in progress, are being used as another method of discovering target genes downstream of the Wg-Dpp-Ras signaling code. G: clustering and motif-extraction algorithms can then be used to discern possible TF binding sites associated with the genes found in the microarray experiments. The AlignACE motif-finding program (48, 86) has been used to discover common motifs in the elements identified in the computation search; at least one of these has been experimentally determined to represent a repressor binding site (45). Yeast one-hybrid analysis and similar methods can now be used to determine the TFs that bind to motifs discovered in this way or to sequences in the regulatory elements that show evolutionary conservation (not pictured). In this way, genetic, molecular, and computational techniques are all being used together to uncover the details of the genetic regulatory networks underlying dorsal mesoderm development.

 
What of additional members of this network? A variety of genomic approaches are now helping to address this question. Microarray analysis is beginning to define additional genes that respond to mesodermal activation of the Wg, Dpp, and Ras/MAPK signaling pathways (Fig. 3F; Halfon MS, unpublished results). A computational search for regulatory modules sharing characteristics of the MHE has identified the mesodermal regulatory sequences for the heartbroken gene (50, 74, 104), a member of the FGF-receptor signaling cascade known from genetic studies to be involved in specification of the Eve mesodermal progenitors (Fig. 3E) (45). Like the MHE, the hbr regulatory module appears to integrate signaling from the three described pathways through the binding of the TFs dTcf, Mad, and Pnt, as well as from the mesodermal selector proteins Twi and Tin. Furthermore, it drives reporter gene expression in a pattern overlapping that driven by the MHE, and its sequence is conserved in an evolutionarily distant species. How is it that a member of a receptor tyrosine kinase signaling cascade, which acts via the Ras/MAPK pathway, has regulatory sequences that show it is regulated by this same pathway? This finding is consistent with genetic (18) and microarray (Halfon MS, unpublished results) data showing positive feedback in the FGF-receptor pathway, providing a clear example of how each of the many threads of the network analysis (genetic, computational, and microarray) can provide cross-validation and reinforcement of the results from each.

Additional analysis of the computational search results using the AlignACE program (48, 86) has revealed a number of previously unidentified TF binding sites in the MHE (Fig. 3G), one of which has been experimentally determined to be involved in transcriptional repression (45). Yeast one-hybrid assays and candidate gene approaches guided by genetic data can be used in an effort to identify the factors that bind these sites. These results, too, can be checked with respect to the microarray analysis and by in situ hybridization of mRNA to whole embryos, which will confirm the spatial expression patterns of the genes. Genetic experiments have revealed the existence of delicately balanced reciprocal interactions between the Ras/MAPK and Notch/Delta signaling pathways (Fig. 3C) (18), including positive feedback leading to signal amplification, cooperative non-autonomous inhibition, and mutually antagonistic cross-talk. As the mechanisms underlying these interactions become understood in greater molecular detail, they too can be added to our picture of the network. In this fashion we are slowly building up a map of the entire regulatory network, starting with the initial inductive signals responsible for progenitor specification and moving through the genes regulated by these signals, the additional cooperating mesodermal cofactors, the cis-regulatory apparatus that integrates this information, and the downstream output of genes involved in mesodermal progenitor development. The latter will in turn provide a basis for understanding the later steps of organogenesis that lead to the foundation of a mature heart and body wall musculature.


    Concluding Remarks
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 
What other developmental systems might be suitable for undertaking a detailed regulatory analysis? An ideal system would be in an organism having a fully sequenced, annotated genome; sequenced sister species; available whole transcriptome microarrays, or better, whole genome microarrays; the ability to undergo genetic manipulation; a range of additional experimental techniques for perturbing gene expression such as regulated ectopic gene expression, construction of dominant-negative proteins, and RNAi or antisense RNA; defined cell culture assays; and a wide range of available cell-specific markers so that experimental outcomes could be assessed in detail. The system itself should be one for which the utilized signal transduction pathways and main tissue-specific TFs have been defined and in which at least one well-characterized transcriptional enhancer exists that can be used as a starting point for cis-element discovery. Most of these characteristics are present or are soon to be present in the genetic model organisms currently favored by developmental biologists and in the organ systems that have been their focus of study. The Drosophila eye, the C. elegans vulva, and the mammalian pituitary gland are just three among the many well-studied systems that seem well-poised to serve as foci for regulatory network discovery (5, 90, 106).

Metazoan development is a complex and highly orchestrated process in which multiple cell movements, signaling interactions, and changes in gene expression must be spatially and temporally coordinated to ensure that embryogenesis proceeds correctly. Understanding in detail each of these many interrelated processes is a daunting task. Posttranscriptional processing of RNAs and proteins plays an important role in development, and our tools for comprehensively investigating these processes remain limited. With the rise of genomic technologies, however, the necessary tools for exploring transcriptional regulation on a system-level scale are finally becoming available. By combining genetic analysis, microarray and other transcriptional profiling methods, and computational approaches to identifying and characterizing cis-regulatory sequences, we can begin not just to unravel complex genetic networks, but also to consider them holistically as part of a single integrated process, embryonic development.


    ACKNOWLEDGMENTS
 
We thank M. Boutros and Y. Grad for helpful comments on the manuscript and S. Choe for sharing the data in Fig. 2.

M. S. Halfon is supported by the Charles A. King Trust of The Medical Foundation. A. M. Michelson is an Associate Investigator of the Howard Hughes Medical Institute.


    FOOTNOTES
 
* This review was submitted in conjunction with the meeting entitled "Physiological Genomics of Cardiovascular Disease: from Technology to Physiology," in San Francisco, CA, February 2002, sponsored by the American Physiological Society. Back

Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).

Address for reprint requests and other correspondence: A. M. Michelson, Brigham and Women’s Hospital, Thorn 10th floor, 20 Shattuck St., Boston, MA 02115 (E-mail: michelson{at}rascal.med.harvard.edu).

10.1152/physiolgenomics.00072.2002.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 Genome Annotation
 Unraveling the Transcriptome
 Motifs and Modules
 The Power of Genetics
 Computer Simulation and...
 Deconstructing a Network: an...
 Concluding Remarks
 REFERENCES
 

  1. Adams MD et al. (Celera Genomics). The genome sequence of Drosophila melanogaster. Science 287: 2185–2195, 2000.[Abstract/Free Full Text]
  2. Andrews J, Bouffard GG, Cheadle C, Lu J, Becker KG, and Oliver B. Gene discovery using computational and microarray analysis of transcription in the Drosophila melanogaster testis. Genome Res 10: 2030–2043, 2000.[Abstract/Free Full Text]
  3. Arnone MI and Davidson EH. The hardwiring of development: organization and function of genomic regulatory systems. Development 124: 1851–1864, 1997.[Abstract/Free Full Text]
  4. Bailey TL and Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36, 1994.[Medline]
  5. Baker NE. Master regulatory genes: telling them what to do. Bioessays 23: 763–766, 2001.[ISI][Medline]
  6. Bate M and Rushton E. Myogenesis and muscle patterning in Drosophila. CR Acad Sci Paris 316: 1047–1054, 1993.
  7. Baugh LR, Hill AA, Brown EL, and Hunter CP. Quantitative analysis of mRNA amplification by in vitro transcription. Nucleic Acids Res 29: E29, 2001.[Medline]
  8. Baylies MK and Michelson AM. Invertebrate myogenesis: looking back to the future of muscle development. Curr Opin Genet Dev 11: 431–439, 2001.[ISI][Medline]
  9. Bergstrom DA, Penn BH, Strand A, Perry RL, Rudnicki MA, and Tapscott SJ. Promoter-specific regulation of MyoD binding and signal transduction cooperate to pattern gene expression. Mol Cell 9: 587–600, 2002.[ISI][Medline]
  10. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, and Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA 99: 757–762, 2002.[Abstract/Free Full Text]
  11. Biggin MD. To bind or not to bind. Nat Genet 28: 303–304, 2001.[ISI][Medline]
  12. Bouvet P. Determination of nucleic acid recognition sequences by SELEX. In: DNA-Protein Interactions: Principles and Protocols (2nd ed.), edited by Moss T. Totowa, NJ: Humana, vol. 148, p. 603–610, 2001.
  13. Britten RJ and Davidson EH. Gene regulation for higher cells: a theory. Science 165: 349–357, 1969.[ISI][Medline]
  14. Bryant Z, Subrahmanyan L, Tworoger M, LaTray L, Liu CR, Li MJ, van den Engh G, and Ruohola-Baker H. Characterization of differentially expressed genes in purified Drosophila follicle cells: toward a general strategy for cell type-specific developmental analysis. Proc Natl Acad Sci USA 96: 5559–5564, 1999.[Abstract/Free Full Text]
  15. Bulyk ML, Gentalen E, Lockhart DJ, and Church GM. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat Biotechnol 17: 573–577, 1999.[ISI][Medline]
  16. Bulyk ML, Huang X, Choo Y, and Church GM. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc Natl Acad Sci USA 98: 7158–7163, 2001.[Abstract/Free Full Text]
  17. Bussemaker HJ, Li H, and Siggia ED. Regulatory element detection using correlation with expression. Nat Genet 27: 167–171, 2001.[ISI][Medline]
  18. Carmena A, Buff E, Halfon MS, Gisselbrecht S, Jimenez F, Baylies MK, and Michelson AM. Reciprocal regulatory interactions between the Notch and Ras signaling pathways in the Drosophila embryonic mesoderm. Dev Biol 244: 226–242, 2002.[ISI][Medline]
  19. Carmena A, Gisselbrecht S, Harrison J, Jiménez F, and Michelson AM. Combinatorial signaling codes for the progressive determination of cell fates in the Drosophila embryonic mesoderm. Genes Dev 12: 3910–3922, 1998.[Abstract/Free Full Text]
  20. Carroll SB, Grenier JK, and Weatherbee SD. From DNA to Diversity: Molecular Genetics and the Evolution of Animal Design. Boston, MA: Blackwell Science, 2001.
  21. Churchill GA and Oliver B. Sex, flies and microarrays. Nat Genet 29: 355–356, 2001.[ISI][Medline]
  22. Claverie JM. Computational methods for the identification of genes in vertebrate genomic sequences. Hum Mol Genet 6: 1735–1744, 1997.[Abstract/Free Full Text]
  23. Consortium, C. elegans Sequencing. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282: 2012–2018, 1998.[Abstract/Free Full Text]
  24. D’Haeseleer P, Liang S, and Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16: 707–726, 2000.[Abstract]
  25. Davidson EH. Genomic Regulatory Systems (1st ed.). San Diego: Academic, 2001.
  26. Davidson EH, Rast JP, Oliveri P, Ransick A, Calestani C, Yuh CH, Minokawa T, Amore G, Hinman V, Arenas-Mena C, Otim O, Brown CT, Livi CB, Lee PY, Revilla R, Rust AG, Pan Z, Schilstra MJ, Clarke PJ, Arnone MI, Rowen L, Cameron RA, McClay DR, Hood L, and Bolouri H. A genomic regulatory network for development. Science 295: 1669–1678, 2002.[Abstract/Free Full Text]
  27. Davuluri RV, Grosse I, and Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet 29: 412–417, 2001.[ISI][Medline]
  28. Dudoit S, Yang YH, Callow MJ, and Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistical Sinica 12: 111–139, 2002.
  29. Eddy SR. Computational genomics of noncoding RNA genes. Cell 109: 137–140, 2002.[ISI][Medline]
  30. Eddy SR. Noncoding RNA genes. Curr Opin Genet Dev 9: 695–699, 1999.[ISI][Medline]
  31. Efron B, Tibshirani R, Storey JD, and Tusher V. Empirical Bayes Analysis of a Microarray Experiment (Technical Report no. 216). Palo Alto, CA: Department of Statistics, Stanford University, 2001.
  32. Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, and Liotta LA. Laser capture microdissection. Science 274: 998–1001, 1996.[Abstract/Free Full Text]
  33. Fickett JW and Wasserman WW. Discovery and modeling of transcriptional regulatory regions. Curr Opin Biotechnol 11: 19–24, 2000.[ISI][Medline]
  34. Frasch M. Controls in patterning and diversification of somatic muscle during Drosophila embryogenesis. Curr Opin Genet Dev 9: 522–529, 1999.[ISI][Medline]
  35. Frasch M. Induction of visceral and cardiac mesoderm by ectodermal Dpp in the early Drosophila embryo. Nature 374: 646–667, 1995.
  36. Frech K, Danescu-Mayer J, and Werner T. A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J Mol Biol 270: 674–687, 1997.[ISI][Medline]
  37. Frith MC, Hansen U, and Weng Z. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17: 878–889, 2001.[Abstract/Free Full Text]
  38. Furlong EE, Andersen EC, Null B, White KP, and Scott MP. Patterns of gene expression during Drosophila mesoderm development. Science 293: 1629–1633, 2001.[Abstract/Free Full Text]
  39. Furlong EE, Profitt D, and Scott MP. Automated sorting of live transgenic embryos. Nat Biotechnol 19: 153–156, 2001.[ISI][Medline]
  40. Gailus-Durner V, Scherf M, and Werner T. Experimental data of a single promoter can be used for in silico detection of genes with related regulation in the absence of sequence similarity. Mamm Genome 12: 67–72, 2001.[ISI][Medline]
  41. Ghazi A and VijayRaghavan K. Control by combinatorial codes. Nature 408: 419–420, 2000.[ISI][Medline]
  42. Gopal S, Schroeder M, Pieper U, Sczyrba A, Aytekin-Kurban G, Bekiranov S, Fajardo JE, Eswar N, Sanchez R, Sali A, and Gaasterland T. Homology-based annotation yields 1,042 new candidate genes in the Drosophila melanogaster genome. Nat Genet 27: 337–340, 2001.[ISI][Medline]
  43. GuhaThakurta D and Stormo GD. Identifying target sites for cooperatively binding factors. Bioinformatics 17: 608–621, 2001.[Abstract/Free Full Text]
  44. Halfon MS, Carmena A, Gisselbrecht S, Sackerson CM, Jiménez F, Baylies MK, and Michelson AM. Ras pathway specificity is determined by the integration of multiple signal-activated and tissue-restricted transcription factors. Cell 103: 63–74, 2000.[ISI][Medline]
  45. Halfon MS, Grad Y, Church GM, and Michelson AM. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res 12: 1019–1028, 2002.[Abstract/Free Full Text]
  46. Hardison RC. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet 16: 369–372, 2000.[ISI][Medline]
  47. Hertz GZ and Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563–577, 1999.[Abstract/Free Full Text]
  48. Hughes JD, Estep PW, Tavazoie S, and Church GM. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296: 1205–1214, 2000.[ISI][Medline]
  49. Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, and Hood L. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292: 929–934, 2001.[Abstract/Free Full Text]
  50. Imam F, Sutherland D, Huang W, and Krasnow MA. stumps, a Drosophila gene required for fibroblast growth factor (FGF)-directed migrations of tracheal and mesodermal cells. Genetics 152: 307–318, 1999.[Abstract/Free Full Text]
  51. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, and Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics In press. (Available online at http://biosun01.biostat.jhsph.edu/~ririzarr/papers/affy1.pdf).
  52. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, and Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409: 533–538, 2001.[ISI][Medline]
  53. Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, and Aronow BJ. Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately expressed genes. Genome Res In press.
  54. Jensen LJ and Knudsen S. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics 16: 326–333, 2000.[Abstract]
  55. Jiang M, Ryu J, Kiraly M, Duke K, Reinke V, and Kim SK. Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. Proc Natl Acad Sci USA 98: 218–223, 2001.[Abstract/Free Full Text]
  56. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, and Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916–919, 2002.[Abstract/Free Full Text]
  57. Karlin S, Bergman A, and Gentles AJ. Genomics. Annotation of the Drosophila genome. Nature 411: 259–260, 2001.[ISI][Medline]
  58. Kerr MK, Afshari C, Bennet L, Bushel P, Martinez J, Walker NJ, and Churchill GA. Statistical analysis of a gene expression microarray experiment with replication. Statistical Sinica 12: 203–217, 2002.
  59. Kerr MK and Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet Res 77: 123–128, 2001.[ISI][Medline]
  60. Kerr MK, Martin M, and Churchill GA. Analysis of variance for gene expression microarray data. J Comput Biol 7: 819–837, 2000.[ISI][Medline]
  61. Klingenhoff A, Frech K, Quandt K, and Werner T. Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics 15: 180–186, 1999.[Abstract/Free Full Text]
  62. Krivan W and Wasserman WW. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 11: 1559–1566, 2001.[Abstract/Free Full Text]
  63. Kuruvilla FG, Park PJ, and Schreiber SL. Vector algebra in the analysis of genome-wide expression data. Genome Biol 3: RESEARCH0011, 2002.
  64. Lander ES et al. (International Human Genome Sequencing Consortium). Initial sequencing and analysis of the human genome. Nature 409: 860–921, 2001.[ISI][Medline]
  65. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262: 208–214, 1993.[ISI][Medline]
  66. Lawrence PA. The Making of a Fly. Boston, MA: Blackwell Scientific, 1992.
  67. Li C and Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98: 31–36, 2001.[Abstract/Free Full Text]
  68. Lieb JD, Liu X, Botstein D, and Brown PO. Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat Genet 28: 327–334, 2001.[ISI][Medline]
  69. Livesey FJ, Furukawa T, Steffen MA, Church GM, and Cepko CL. Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx. Curr Biol 10: 301–310, 2000.[ISI][Medline]
  70. Loots GG, Ovcharenko I, Pachter L, Dubchak I, and Rubin EM. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 12: 832–839, 2002.[Abstract/Free Full Text]
  71. Markstein M, Markstein P, Markstein V, and Levine M. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc Natl Acad Sci USA 99: 763–768, 2002.[Abstract/Free Full Text]
  72. Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, and Dubchak I. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16: 1046–1047, 2000.[Abstract]
  73. Meir E, von Dassow G, Munro E, and Odell GM. Robustness, flexibility, and the role of lateral inhibition in the neurogenic network. Curr Biol 12: 778–786, 2002.[ISI][Medline]
  74. Michelson AM, Gisselbrecht S, Buff E, and Skeath JB. Heartbroken is a specific downstream mediator of FGF receptor signalling in Drosophila. Development 125: 4379–4389, 1998.[Abstract/Free Full Text]
  75. Murre C, Schonleber McCaw P, and Baltimore D. A new DNA binding and dimerization motif in immunoglobulin enhancer binding, daughterless, MyoD, andmyc proteins. Cell 56: 777–783, 1989.[ISI][Medline]
  76. Ohler U and Niemann H. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet 17: 56–60, 2001.[ISI][Medline]
  77. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18: 546–554, 2002.[Abstract/Free Full Text]
  78. Papatsenko DA, Makeev VJ, Lifanov AP, Regnier M, Nazina AG, and Desplan C. Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. Genome Res 12: 470–481, 2002.[Abstract/Free Full Text]
  79. Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP, Nguyen BT, Norris MC, Sheehan JB, Shen N, Stern D, Stokowski RP, Thomas DJ, Trulson MO, Vyas KR, Frazer KA, Fodor SP, and Cox DR. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719–1723, 2001.[Abstract/Free Full Text]
  80. Pennacchio LA and Rubin EM. Genomic strategies to identify mammalian regulatory sequences. Nat Rev Genet 2: 100–109, 2001.[ISI][Medline]
  81. Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-Dale AL, Brown PO, and Botstein D. Molecular portraits of human breast tumours. Nature 406: 747–752, 2000.[ISI][Medline]
  82. Pilpel Y, Sudarsanam P, and Church GM. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29: 153–159, 2001.[ISI][Medline]
  83. Quackenbush J. Computational analysis of microarray data. Nat Rev Genet 2: 418–427, 2001.[ISI][Medline]
  84. Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, and Lewis SE. Genome annotation assessment in Drosophila melanogaster. Genome Res 10: 483–501, 2000.[Abstract/Free Full Text]
  85. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, and Young RA. Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309, 2000.[Abstract/Free Full Text]
  86. Roth FP, Hughes JD, Estep PW, and Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat Biotechnol 16: 939–945, 1998.[ISI][Medline]
  87. Ruvkun G. Molecular biology: glimpses of a tiny RNA world. Science 294: 797–799, 2001.[Abstract/Free Full Text]
  88. Schmid KJ and Aquadro CF. The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes. Genetics 159: 589–598, 2001.[Abstract/Free Full Text]
  89. Schwartz S, Zhang Z, Frazer KA, Smit A, Riemer C, Bouck J, Gibbs R, Hardison R, and Miller W. PipMaker: a web server for aligning two genomic DNA sequences. Genome Res 10: 577–586, 2000.[Abstract/Free Full Text]
  90. Scully KM and Rosenfeld MG. Pituitary development: regulatory codes in mammalian organogenesis. Science 295: 2231–2235, 2002.[Abstract/Free Full Text]
  91. Sherlock G. Analysis of large-scale gene expression data. Curr Opin Immunol 12: 201–205, 2000.[ISI][Medline]
  92. Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD, Loerch PM, Leonardson A, Lum PY, Cavet G, Wu LF, Altschuler SJ, Edwards S, King J, Tsang JS, Schimmack G, Schelter JM, Koch J, Ziman M, Marton MJ, Li B, Cundiff P, Ward T, Castle J, Krolewski M, Meyer MR, Mao M, Burchard J, Kidd MJ, Dai H, Phillips JW, Linsley PS, Stoughton R, Scherer S, and Boguski MS. Experimental annotation of the human genome using microarray technology. Nature 409: 922–927, 2001.[ISI][Medline]
  93. Shvartsman SY, Muratov CB, and Lauffenburger DA. Modeling and computational analysis of EGF receptor-mediated cell communication in Drosophila oogenesis. Development 129: 2577–2589, 2002.[Abstract/Free Full Text]
  94. Sokal RR and Rohlf FJ. Biometry (3rd ed.). New York: Freeman, 1969.
  95. Staehling-Hampton K, Hoffmann FM, Baylies MK, Rushton E, and Bate M. dpp induces mesodermal gene expression in Drosophila. Nature 372: 783–786, 1994.[ISI][Medline]
  96. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics 16: 16–23, 2000.[Abstract]
  97. Suzuki Y, Yamashita R, Nakai K, and Sugano S. DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucleic Acids Res 30: 328–331, 2002.[Abstract/Free Full Text]
  98. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, and Church GM. Systematic determination of genetic network architecture. Nat Genet 22: 281–285, 1999.[ISI][Medline]
  99. Thomas JG, Olson JM, Tapscott SJ, and Zhao LP. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 11: 1227–1236, 2001.[Abstract/Free Full Text]
  100. Tusher VG, Tibshirani R, and Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121, 2001.[Abstract/Free Full Text]
  101. Van Gelder RN, von Zastrow ME, Yool A, Dement WC, Barchas JD, and Eberwine JH. Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc Natl Acad Sci USA 87: 1663–1667, 1990.[Abstract]
  102. Velculescu VE, Zhang L, Vogelstein B, and Kinzler KW. Serial analysis of gene expression. Science 270: 484–487, 1995.[Abstract]
  103. Venter JC et al. (Celera Genomics). The sequence of the human genome. Science 291: 1304–1351, 2001.[Abstract/Free Full Text]
  104. Vincent S, Wilson R, Coelho C, Affolter M, and Leptin M. The Drosophila protein Dof is specifically required for FGF signaling. Mol Cell 2: 515–525, 1998.[ISI][Medline]
  105. von Dassow G, Meir E, Munro EM, and Odell GM. The segment polarity network is a robust developmental module. Nature 406: 188–192, 2000.[ISI][Medline]
  106. Wang M and Sternberg PW. Pattern formation during C. elegans vulval induction. Curr Top Dev Biol 51: 189–220, 2001.[ISI][Medline]
  107. Wasserman WW and Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278: 167–181, 1998.[ISI][Medline]
  108. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhauser R, Pruss M, Schacherer F, Thiele S, and Urbach S. The TRANSFAC system on gene expression regulation. Nucleic Acids Res 29: 281–283, 2001.[Abstract/Free Full Text]
  109. Wolfinger RD, Gibson G, Wolfinger ED, Bennet L, Hamdeh H, Bushel P, Afshari C, and Paules RS. Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 8: 625–637, 2001.[ISI][Medline]
  110. Wu X, Golden K, and Bodmer R. Heart development in Drosophila requires the segment polarity gene wingless. Dev Biol 169: 619–628, 1995.[ISI][Medline]
  111. Wyrick JJ and Young RA. Deciphering gene expression regulatory networks. Curr Opin Genet Dev 12: 130–136, 2002.[ISI][Medline]
  112. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, and Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15, 2002.[Abstract/Free Full Text]
  113. Zhu J, Liu JS, and Lawrence CE. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14: 25–39, 1998.[Abstract]