1 Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania
2 Center for Bioinformatics, University of Pennsylvania, Philadelphia, Pennsylvania
3 INSERM 381, Strasbourg, France
4 Genome Sequencing Center, Washington University, St. Louis, Missouri
5 Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts
6 Department of Internal Medicine, Washington University, St. Louis, Missouri
![]() |
ABSTRACT |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Despite recent progress in ß-cell biology and diabetes research, tools for the treatment of diabetes have not changed fundamentally. Although it is now clear that islet transplantation is a valuable therapeutic approach, this solution is severely limited by the shortage of islet tissue. Over the past decade, significant advances have been made toward identifying the hierarchy of transcription factors that govern pancreatic development (1). In addition, it has been shown that embryonic stem cells can be differentiated in vitro toward insulin-producing cells, although the issue remains controversial (24). Despite these discoveries, major obstacles to the isolation, expansion, and differentiation of pancreatic endocrine stem and/or progenitor cells exist, including a lack of appropriate cell surface antibodies for sorting of progenitor cell populations and an only rudimentary understanding of the lineage of ß-cells during development and regeneration of the pancreas.
To accelerate the progress toward the identification of endocrine precursor cells and factors that regulate the development and differentiation of ß-cells, the National Institute of Diabetes and Digestive and Kidney Diseases sponsored a program entitled "Functional Genomics of the Developing Endocrine Pancreas" in 1999. The Endocrine Pancreas Consortium was created in response to this program to construct and sequence cDNA libraries derived from multiple stages of pancreatic development. Its purpose was to provide the public expressed sequence tag (EST) databases with sequences from mouse and human endocrine pancreas to discover novel transcripts that could be incorporated into custom microarrays to enhance research in diabetes and other metabolic diseases. A limited pancreas microarray or "PancChip," based on a combination of expression analysis and database mining but not on any novel cDNA libraries, has been described previously and made available to the diabetes research community (5).
Here we summarize the efforts of the Endocrine Pancreas Consortium over the past 3 years. We have constructed and sequenced 20 cDNA libraries from a variety of pancreatic sources from mice and humans, yielding >150,000 EST sequences to date. All EST clones have been submitted to the IMAGE (Integrated Molecular Analysis of Genomes and their Expression) Consortium for distribution. The sequences provided by the Endocrine Pancreas Consortium have allowed the identification of thousands of transcripts that have not yet been described in any other library, illustrating the usefulness of targeted EST projects that allow for in-depth sequencing of high-quality libraries. The Consortium ESTs were also clustered to derive nonredundant clone sets that represent >10,000 assembly groups approximating transcripts expressed in the pancreas of mice and humans. We illustrate one use of these clone sets by building a cDNA microarray that captures a vast proportion of the pancreatic and islet gene expression profile. These nonredundant pancreas clone sets derived from our cDNA libraries promise to become a useful tool for the diabetes research community.
![]() |
RESEARCH DESIGN AND METHODS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
cDNA libraries were constructed with the SuperScript Plasmid system kit (Invitrogen Life Technologies). To increase RT processivity and thereby increase the yield of long cDNAs, oligo-dT primed first-strand cDNA synthesis was performed at 42°C in a 20-µl reaction containing 35 units RNA Guard Porcine RNase inhibitor (Amersham Pharmacia Biotech) and 4.15 µg T4 Gene 32 protein (United States Biochemical). Remaining steps were performed according to the SuperScript kit instructions. Libraries are directional, with a NotI site in the oligo-dT primer adapter used to prime the cDNA synthesis and a SalI site at the 5' end (with the T3 and T7 promoters at the 3' and 5' ends, respectively). Double-stranded cDNA was size-fractionated over Sephacryl S-500 (Sigma), cloned into the pSPORT1 vector, and electroporated into DH10B Escherichia coli. After preliminary analysis to assess quality, the libraries were amplified once on solid support, and plasmid DNA from each library was prepared.
Library normalization was performed by method number four from Bonaldo et al. (6). Briefly, single-stranded DNA was prepared by phage F1 endonuclease (Gene II protein; Invitrogen Life Technologies) and Exonuclease III digestion and purified by hydroxyapatite chromatography. An aliquot of the single-stranded DNA was used as a template to generate PCR products representing library inserts. Single-stranded library plasmid DNA (0.5 µg) was mixed with 5 µg PCR product and hybridized to an Ecot of 6 (for N1-MMS1; 48 h) or 20 (for N4-HIS1; 160 h) at 30°C in 0.12 mol/l NaCl and 50% formamide. Ecot was calculated using a of 0.41 (7) and a formamide correction factor of 0.45 (8). Single-stranded (unhybridized) plasmids were isolated by hydroxyapatite chromatography, repaired to double-stranded DNA using random hexamers, and electroporated into DH10B E. coli.
The SMART PCR cDNA Synthesis Kit (Clontech) was used to obtain full-length cDNA from E14.5 control mRNA from wild-type mouse pancreas. A modified 3'-oligo (dT) containing a NotI site was used to prime the first-strand synthesis reaction to allow directional cloning. This full-length cDNA was used for PCR amplification to generate double-stranded cDNA. After size selection (>1 kb), SalI adapters were ligated onto both ends of the control cDNA, digested with NotI, and cloned into pSPORT1 vector (LTI). The library was electroporated into DH12S cells (Invitrogen Life Technologies). A total of 22,080 clones were picked and sequenced. The same procedure was used to construct a E14.5 neurogenin 3 (Ngn3) mutant library from pancreata dissected from Ngn3-/- embryos (9), with the exception that this library was cloned into pSPORT2 (LTI). A total of 5,376 clones were picked and sequenced from this library. For the construction of the Ngn3 wild-type/mutant subtracted library, both control (target) and Ngn3-/- (driver) cDNA libraries were amplified once. The ssDNA from the control cDNA library was prepared by infecting with M13KO7 and blocked with 5'-GCGGCCGCT15 oligo before hybridization. SalI-digested double-stranded DNA from the Ngn3-/- cDNA library was used to synthesize biotinylated RNA with T7 RNA polymerase. Subtractive hybridization was performed at 42°C in 80% formamide, 100 mmol/l HEPES (pH 7.5), 2 mmol/l EDTA, and 0.2% SDS. Streptavidin was used to remove clones common between the two libraries. After repair of the subtracted ssDNA with 5'-GCGGCCGCT15 oligo using Taq polymerase, the subtracted cDNA library was electroporated into DH12S cells (LTI), and 2,880 clones were picked and sequenced.
For the human islet cDNA library (HR85), poly-A+ RNA was extracted from adult human pancreatic islets that had been isolated and purified by Dr. Barbara J. Olack (Washington University School of Medicine). cDNA was made by oligo-dT priming (Superscript Plasmid System, Life Technologies), size-selected on agarose gels, and cloned into the NotI/XhoI site of pBluescript SK (Stratagene). The 5' XhoI site was destroyed after directional cloning. The library was electroporated into DH10B cells and amplified once on solid medium. The average insert size was 1 kb.
Sequencing.
The EST sequence was generated by the Darwin EST Production Team, Genome Sequencing Center, Washington University, in the St. Louis School of Medicine and submitted to the database of ESTs.
Cluster analysis and microarray construction.
All mouse (cutoff date 19 August 2002) and human (2 October 2002) ESTs were obtained from dbEST (10), including those generated as described above. These were used along with identifiable mRNAs in GenBank (10) as part of the Database of Transcribed Sequences (DoTS) v.5 build. Identifiable mRNAs include RefSeq, RIKEN, and GenBank records of type "mRNA" or "RNA" with annotated coding sequences. Details can be found at www.allgenes.org/statistics.html. A manuscript describing the DoTS build process is in progress and will be published elsewhere. Briefly, vector sequences were detected and removed using cross_match (www.phrap.org) and the GenBank vector database. Also removed were ribosomal and mitochondrial sequences, trailing poly A and leading poly T sequences, and poor quality ends where the percentage of unknown bases in a 20-bp window exceeded 20%. Sequences were then blocked for repeats using RepeatMasker (www.phrap.org), with the relevant libraries of repeats depending on the organism used as the source for library construction. If <50 bp of informative sequence remained, then these sequences were not used further. This process eliminated 10,169 of the starting 49,829 mouse sequences and 3,989 of the starting 55,798 human sequences. The remaining blocked sequences were clustered by running a BLASTN (11) matrix with parameters of N = 10 and M = 5. Clusters were formed by a connected components analysis of all the BLASTN matches with minimum cutoff values of 92% identity and 40 bp length. Very large clusters (>10,000 members) were broken up by incrementally increasing cutoff thresholds as needed to 95% identity and 50 bp overlap, then 98% identity and 100 bp overlap. The clusters were assembled to form consensus sequences using the CAP4 algorithm packaged with the Paracel Transcript Assembler (www.paracel.com). The resulting consensus sequences were then blocked with RepeatMasker, clustered with BLASTN (95% identity, 75 bp overlap) and incrementally assembled with CAP4 to complete the DoTS build. The assembly consensus sequences are assigned a stable DoTS identifier and subjected to a number of automated annotations including assignment of Gene Ontology (GO) functions and BLASTX query against the nonredundant database (nrdb) of protein sequences at the National Center for Biotechnology Information. Homology to the nonredundant database was used to generate Table 1. Upon completion of the DoTS build, all assemblies containing an EST from one of the Consortium cDNA libraries were identified to generate the data presented here. These assemblies were used for BLASTN analysis to dbEST for the "EST matches only" category in Table 1. Clone information was not used during the DoTS build process because errors in clone assignments lead to generating chimeric assemblies. However, clone information was used to form assembly groups to compress those DoTS assemblies containing at least one clone from the Consortium libraries to better represent transcripts.
|
Preparing and labeling RNA.
Ten adult CD1 male mice were killed, and the pancreata were immediately homogenized in 10 ml denaturing solution (4 mol/l guanidium thiocyanate, 0.1 mol/l Tris-Cl, pH 7.5, 1% ß-mercaptoethanol) per organ. Total RNA was extracted using the acid-phenol extraction method (12). Mouse islets were isolated after collagenase treatment and purification over a Ficoll gradient (13) and immediately homogenized in 1 ml TRIzol Reagent (Life Technologies). RNA was purified following the manufacturers protocols with the exception that 20 µg glycogen (Roche) was added to each sample. Subsequently, RNA pellets were washed with 75% ethanol and resuspended in 300 µl TES (10 mmol/l Tris, pH 7.5, 1 mmol/l EDTA, 0.1% SDS). The RNA was re-extracted with 600 µl phenol:chloroform:isoamyl alcohol (25:24:1) and precipitated with 1/10 volume 3 mol/l sodium acetate and three volumes ethanol and stored at -80°C until use.
cDNAs were labeled with a modified indirect labeling protocol. Total RNA (20 µg) and 0.4 pmol oligo-dT21 were brought to 25 µl with diethyl pyrocarbonate-treated water and incubated for 5 min at 70°C. The RNA mixture was then cooled to 42°C. An equal volume of reaction mix (2x first-strand buffer [Invitrogen], 0.5 mmol/l dATP, 0.5 mmol/l dGTP, 0.5 mmol/l dCTP, 0.3 mmol/l dTTP, 0.2 mmol/l amino-allyl dUTP, 10 mmol/l dithiothreitol, 20 units RNasin [Promega], and 400 units Superscript II Reverse Transcriptase [Invitrogen]) was added, and the reaction was incubated for 2 h at 42°C. The reaction was terminated by bringing it to 0.202 N NaOH and 0.101 mol/l EDTA and incubated at 70°C for 10 min. The reaction was neutralized by adjusting it to 0.334 mol/l Tris-HCl, pH 7.5. After purification with a Micron YM-30 Concentrator (Millipore), the reactions were dried in a vacuum centrifuge. The cDNA was brought up in 15 µl sterile deionized water. A CyScribe Post Labeling Kit containing Cy5 or Cy3 dye (40,000 pmol each) (Amersham Pharmacia RPN5661) were brought up in 15 µl freshly prepared 0.1 mol/l Na bicarbonate buffer, pH 9.0, and added to the cDNA. The reactions were incubated 1 h at 25°C. The reactions were terminated with the addition of 15 µl hydroxylamine and 15 min incubation at 25°C. The coupling reactions were combined, purified with Qiaquick PCR Purification Kit (Qiagen), and precipitated with 1 µl polyacryl carrier (Molecular Research Center), 0.1 volume 1 mol/l Na acetate (pH 5.2), and three volumes ethanol at -20°C overnight. Following precipitation, the pellets were air-dried.
Hybridization.
In preparation for hybridization, the cDNA pellets were resuspended in 15 µl sterile deionized water. Then, 5 µl oligo-DT21 blocker (0.5 µg/µl) and 2.5 µl Mouse Cot1 DNA (1 µg/µl; Invitrogen) were added to the cDNA and incubated at 95°C for 5 min. An equal volume of prewarmed (42°C) 2x hybridization buffer (50% Formamide, 10x sodium chloride-sodium citrate [SSC], 0.2% SDS) was added, and the sample was transferred to a prehybridized glass array, covered with a coverslip (22 x 60 mm), and incubated overnight in a Corning hybridization chamber at 42°C. The coverslip was removed from the labeled array in 2x SSC, 0.1% SDS. The arrays were then washed two times for 5 min each with agitation: once at 40°C in 0.2x SSC, 0.1% SDS, and once in 0.2x SSC at room temperature and then dried by centrifugation in a slide rack for 3 min at 1,000 rpm.
Scanning and image analysis.
All slides were scanned immediately after hybridization and washing using a Genepix 400b scanner. The laser power was set to 100%, and the gain of the photomultiplier tube was varied to avoid signal saturation in any spots. The image analysis was performed with GenepixPro 3.0. Signal and background intensities were determined by the median pixel values. All of the array data, as well as more detailed descriptions of the methods used, are available through www.cbil.upenn.edu/EPConDB.
![]() |
RESULTS |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
Given that the pancreas contains only 2% endocrine cells, it was of course desirable to construct cDNA libraries only from pancreatic islets. This was possible for both adult mouse and human islets, where we used isolated islets as starting material. In the developing pancreas, endocrine cells and precursors are interspersed among the mesenchymal and exocrine cells; thus, the endocrine compartment cannot be isolated by physical means. Therefore, we used a genetic approach to enrich for transcripts expressed in the fetal endocrine compartment, making use of mice lacking Ngn3. In these mice, the entire endocrine compartment of the pancreas fails to develop (9,15). Thus, by constructing a subtracted library from wild-type and Ngn3 null (consisting of only mesenchymal and exocrine cells) fetal pancreas, we were able to obtain a cDNA library enriched in transcripts expressed in the endocrine compartment, including those that are only expressed transiently during development. All libraries and source tissues are detailed in Table 2.
|
The pancreas transcriptome.
In any cDNA sequencing project, even when using normalized and/or subtracted libraries, there is significant redundancy. To assess how many unique transcripts were represented in our cDNA libraries, and to thus obtain a first pass description of the entire pancreatic transcriptome, we generated cDNA assemblies using a computational approach. In this process, all mouse and human sequences contained in dbEST and mRNAs from GenBank are organized as DoTS assemblies as part of the Genomics Unified Schema data system (16), accessible through AllGenes (www.allgenes.org). Each entry or assembly represents a consensus of overlapping, confirmed, and putative transcribed sequences. The complete human and mouse DoTS assemblies, derived from >4 million human and 2.5 million mouse ESTs and mRNAs, were then queried for those that contain at least one entry from Consortium libraries. The
50,000 Consortium mouse ESTs available at the time of the last DoTS built were distributed among 13,484 assemblies (approximating unique transcripts). Clone information was used to group assemblies that did not contain overlapping sequences but were from the same transcript, reducing the number to 9,464 assembly groups. Similarly, for
50,000 Consortium human ESTs, we obtained 20,411 assemblies and 13,910 assembly groups. Of these assembly groups, 1,821 were unique to Consortium mouse libraries, whereas 2,529 were novel in our human libraries. Most of these assembly groups consist of only one cDNA clone (1,604/1,821 for the mouse, 2,210/2,529 for the human sequences). Because most of these assemblies have significant similarity to known genes or other ESTs, it is likely that they represent rare transcripts and not genomics contamination of the libraries. Estimates of the number of transcripts expressed in a given cell range from 10,000 to 20,000. For an organ made up of multiple cell types such as the pancreas, the number of transcripts is likely to be higher. Nonetheless, the 9,464 mouse and 13,910 human assembly groups represent a major fraction of the pancreatic transcriptome.
Despite the fact that millions of ESTs and cDNAs existed before we began our project, this library construction and sequencing effort has yielded several thousand candidates for novel and rare transcripts. This is due in large part to very deep sequencing of our best libraries. Whereas some apparently novel transcripts may be cloning artifacts such as genomic contaminants, assembly groups consisting of sequences from more than one clone and/or having significant homology to another gene are likely to be real transcripts. Expression analysis with our mouse pancreas cDNA microarray derived from these assembly groups also supports the notion that the vast majority of these clones are derived from true transcripts (see below).
Characterization of pancreatic transcripts.
The pancreas assembly groups described above were categorized in two ways. Using BLAST similarity to the GenBank/European Molecular Biology Laboratories/DNA Data Bank of Japan nonredundant protein database, we determined the fraction of assembly groups that came from known mouse genes, those that were known genes from other species (but not previously found in mice), and those that were only homologous to known genes and represent novel family members. In addition, we used BLAST similarity to dbEST to assess which of the remaining assembly groups had at least been found as an EST in another species versus those that are completely novel. The result of this analysis is provided in Table 1. For both the human and mouse transcriptomes, the fraction of exact matches to known genes was 25%, whereas
25% had no match or only matched previous ESTs.
We also categorized the assembly groups according to GO functions. The GO is a database of universally accepted terms for molecular functions, biological processes, and cellular components (17). The GO also includes the hierarchical relationships between the terms. The GO functions were predicted using a method we recently developed that has an estimated accuracy of 85% (18). We were able to provide a GO function prediction for approximately half of the assembly groups containing Consortium ESTs. The percentage was much lower, however, for those assembly groups containing ESTs unique to Consortium libraries, which is not surprising given that these sequences are novel and/or rare transcripts. The breakdown of top-level GO functions is illustrated in Fig. 1. The largest group in both the human and mouse transcriptome were genes encoding enzymes, nucleic acid binding proteins, proteins involved in ligand binding, and proteins involved in signal transduction. These were also the top categories found when including all DoTS assemblies (www.allgenes.org/statistics.html). Examples of pancreatic transcripts that appear to be novel but for which functions can be predicted by homology to a portion of a known protein are provided in Table 3.
|
|
|
![]() |
DISCUSSION |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|
The discovery of known and novel genes was also consistent between the mouse and human sets. Known genes (exact matches and homologues) comprised 4044% of the transcripts found. Transcripts that could be identified based on homology to other species constituted 3035%. This category provides much promise because the homology to known proteins increases our ability to predict their function and, consequently, identify novel candidates for cellular processes of interest. Approximately one-quarter of the transcripts could not be matched to any known sequence or only to ESTs from other species. These will require further information from other sources (such as expression profiles from microarray analysis) to gain insight into their role in the pancreas. Recently, similar analyses of hematopoietic stem cells (20) and stromal cells (21) were published. Over half of the transcripts found in these studies were of the unknown or "EST only" category, whereas only 56% represented new family members. Thus, the pancreatic transcripts provide a richer resource for analysis of novel genes related to known proteins.
Sequence analysis of the Consortium libraries also revealed the surprising fact that >4,000 assembly groups only contained clones from our libraries, thus representing completely novel transcripts. These sequences represent a rich resource for future genetic and biochemical investigations. In addition, all the sequences obtained and characterized by the Consortium are valuable for further annotation of the mouse and human genomes. In many cases, alignment of our assemblies to genomic sequence revealed that there were no gene predictions matching the assembly; thus, our cDNA sequencing project will lead to an increase in the number of known genes.
The classes of gene function represented in the Consortium clone sets roughly mirrors what was observed for the set of DoTS assemblies for all tissues. The consistency of the categories for mouse and human GO function assignments (Fig. 1) suggests that the observed distribution reflects the true transcript classes of normal mammalian pancreatic tissues. One striking difference between the predicted pancreatic GO functions and those predicted for all tissues is the underrepresentation of signal transduction genes in pancreatic tissues (1015% of predicted functions) compared with all tissues (1626%). Defense/immunity predictions were also underrepresented (1% for pancreatic vs. 48% for all tissues), although this is not surprising because the normal pancreas is not involved in the immune response.
Perhaps the most immediate impact of this work is the development of tissue-specific microarrays derived from our clone sets. These clone sets allow for the production of large numbers of cDNA microarrays at low cost, facilitating the use of large numbers of replicates in gene expression profiling experiments, which leads to increased data quality. Also, because many of the cDNAs contained on our microarrays are not contained on commercial platforms at present, they provide a unique tool to diabetes researchers.
![]() |
ACKNOWLEDGMENTS |
---|
We are grateful to Dr. Doug Melton for his valuable input during the entire project. We also thank Brian Brunk for helping generate the DoTS build and thank members of the Computational Biology and Informatics Laboratory in the Penn Center for Bioinformatics for helpful discussions.
Address correspondence and reprint requests to Klaus H. Kaestner, Department of Genetics, University of Pennsylvania, 415 Curie Blvd., Philadelphia, PA 19104. E-mail: kaestner{at}mail.med.upenn.edu
Received for publication January 13, 2003 and accepted in revised form March 31, 2003
DoTS, Database of Transcribed Sequences; EST, expressed sequence tag; GO, Gene Ontology; nrdb, nonredundant database; SSC, sodium chloride-sodium citrate
![]() |
REFERENCES |
---|
![]() ![]() ![]() ![]() ![]() ![]() |
---|