Large-scale gene discovery in human airway epithelia reveals novel transcripts
Todd E. Scheetz3,6,
Joseph Zabner2,
Michael J. Welsh2,
Justin Coco1,
Mari Eyestone1,
Maria de Fatima Bonaldo1,
Tamara Kucaba1,
Thomas L. Casavant4,5,6,
M. Bento Soares1 and
Paul B. McCray, Jr.1
1 Departments of Pediatrics
2 Internal Medicine
3 Ophthalmology
4 Electrical and Computer Engineering
5 Biomedical Engineering
6 Center for Bioinformatics and Computational Biology, Roy J. and Lucille A. Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242
 |
ABSTRACT
|
---|
The airway epithelium represents an important barrier between the host and the environment. It is a first site of contact with pathogens, particulates, and other stimuli, and has evolved the means to dynamically respond to these challenges. In an effort to define the transcript profile of airway epithelia, we created and sequenced cDNA libraries from cystic fibrosis (CF) and non-CF epithelia and from human lung tissue. Sequencing of these libraries produced
53,000 3'-expressed sequence tags (3'-ESTs). From these, a nonredundant UniGene set of more than 19,000 sequences was generated. Despite the relatively small contribution of airway epithelia to the total mass of the lung, focused gene discovery in this tissue yielded novel results. The ESTs included several thousand transcripts (6,416) not previously identified from cDNA sequences as expressed in the lung. Among the abundant transcripts were several genes involved in host defense. Most importantly, the set also included 879 3'-ESTs that appear to be novel sequences not previously represented in the National Center for Biotechnology Information UniGene collection. This UniGene set should be useful for studies of pulmonary diseases involving the airway epithelium including cystic fibrosis, respiratory infections and asthma. It also provides a reagent for large-scale expression profiling.
normalization; subtraction; expressed sequence tag; UniGene; cystic fibrosis
 |
INTRODUCTION
|
---|
CYSTIC FIBROSIS (CF) is caused by mutations in the cystic fibrosis transmembrane conductance regulator (CFTR), an epithelial chloride channel regulated by phosphorylation (28). Thus CF is fundamentally a disease of epithelia. Although the disease affects many organs, progressive lung disease accounts for most of the CF-associated morbidity and mortality (5). In addition to its chloride channel function, there is evidence CFTR has complex regulatory interactions with other cellular proteins. Genomics-based approaches to identify novel transcripts and study gene expression in human airway epithelia could provide new insights into epithelial function and CF disease pathogenesis. The goal of the present study was to apply methods of focused gene discovery to human airway epithelia.
The lung is composed of airway and alveolar epithelia, submucosal glands, interstitial cells, vascular tissue, smooth muscle, cartilage, neuronal tissue, and circulating and resident hematopoietic cells. Mercer et al. (15) measured the total surface area of human airways from the trachea to the bronchioles and found it to be only 0.2 m2. This is a small proportion of the estimated total area of the human alveolar surface area of 100 m2 (15). Furthermore, the estimated number of cells in the airways (
1 x 1010 cells) is a minor fraction of the estimated total number of cells in the alveoli (
2 x 1012 cells) (15). These calculations indicate that the airway epithelium represents a small portion of the total cell mass in the lung. Therefore, cDNA libraries derived from whole lung RNA may greatly under represent the transcripts expressed in the airway epithelium. Moreover, mRNAs from normal lungs may not completely represent the transcriptional capabilities of this highly environmentally regulated mucosal surface. For example, antimicrobial peptides such as human ß-defensin-2 are normally expressed at very low levels in airway epithelia but are markedly induced under conditions associated with inflammation (8, 21). We evaluated the representation of ESTs from human lung epithelia or lung tissue in the dbEST database, particularly in the human UniGene collection developed at the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/). Consistent with the underrepresentation of airway epithelia, a search of the dbEST database found no sequences from human non-CF or CF airway epithelia or from CF lung. These results indicate that human airway epithelia and lung tissue are likely to be poorly represented in the human UniGene and EST collections. This suggests that microarray-based studies using data sets derived from the human UniGene and EST collections may not fully reflect the transcript diversity of the airway epithelium. Furthermore, this finding suggests that focused gene discovery efforts may rapidly produce comprehensive collections of ESTs from airway epithelia or lung tissue. Such approaches have been utilized in many cell and tissue types to identify a more comprehensive set of transcripts in these cells (9, 14, 18).
In the present study, three tissue sources were utilized to construct cDNA libraries: 1) primary cultures of well-differentiated non-CF epithelia grown under several conditions, 2) primary cultures of well-differentiated CF epithelia grown under several conditions, and 3) whole fetal and adult lung. Each of the three initial nonnormalized libraries was analyzed to generate
1,000 sequences each. These libraries were then each individually normalized, and a further 9,00013,000 sequences were generated. Finally, a subtracted library constructed from a pool of the two epithelial libraries was generated and sequenced. The resultant comprehensive UniGene set of the cDNAs expressed in airway epithelia and lung provides a novel tool for gene discovery and expression profiling in the airway epithelium and lung and may be of broad interest for studies of CF, asthma, and other lung diseases.
 |
MATERIALS AND METHODS
|
---|
Tissue Specimens and Cell culture
Tissue specimens.
Adult and fetal (first and second trimester) human lung was used for isolation of total RNA and library construction. Human fetal samples from tissues of the following ages were included: 9, 12, 19, and 42 wk. Adult lung from two donors was included. This study was approved by the Institutional Review Board at the University of Iowa.
Primary cultures of human airway epithelia.
Airway epithelial cells were isolated from nasal, tracheal, and bronchial tissues obtained from CF and non-CF donors. Cells were seeded onto collagen-coated, semipermeable membranes (0.6 cm2 Millicell-HA; Millipore, Bedford, MA) and grown at the air-liquid interface as previously described (10). Epithelial cells were cultured in a 1:1 mixture of Dulbeccos modified Eagles media and Hams F12 media that was supplemented with 2% Ultroser G (BioSepra, Villeneuve la Garenne, France) and 100 mU/ml penicillin, 100 µg/ml streptomycin, 10 µg/ml gentamicin, 25 µg/ml colimycin, and 75 µg/ml ceftazidime, 25 µg/ml imipenem, 25 µg/ml cilastin, and 2 µg/ml fluconazole. Basolateral culture media was changed every 24 days. Representative samples from all epithelia preparations were evaluated for morphology using scanning electron microscopy to document the development of a ciliated apical surface. The bioelectric properties of each preparation were also characterized to verify phenotypes. All specimens were genotyped for CFTR mutations. All CF specimens used in this study were homozygous or heterozygous for the
F508 mutation, the most common CF-causing mutation. Samples used in the analysis were all well differentiated as determined by scanning electron microscopy and showed bioelectric properties consistent with normal epithelia or manifested the chloride transport defect characteristic of CF. All samples were cultured for >4 wk prior to use in the studies. Samples were collected with approval from the University of Iowa Institutional Review Board.
Cell culture conditions.
To prepare samples that reflect a broad range of the transcripts expressed by airway epithelia, we exposed cells to a variety of conditions (see Table 1). Because cells from non-CF epithelia were more abundant, they were treated with a greater number of conditions. The sources of the reagents are as follows: Clonetics media (BioWhittaker, Walkersville, MD), Ultroser G (BioSepra), keratinocyte growth factor (KGF; Amgen, Thousand Oaks, CA), heregulin, IL-1ß, -6, -8, -9, and -13, secretory leukocyte protease inhibitor (SLPI; R & D Systems, Minneapolis, MN), dexamethasone, triiodothyronine (T3), neutrophil elastase, Escherichia coli lipopolysaccharide (LPS), Pseudomonas aeruginosa LPS, Klebsiella pneumonia LPS (Sigma, St. Louis, MO), and adenovirus (ATCC, Manassas, VA). Pseudomonas elastase and pyocyanin were a generous gift of Dr. Charles Cox. Haemophilus influenzae strain 12 was a gift from Dr. Dwight Look. P. aeruginosa PAO1 was provided by Dr. Pete Greenberg.
mRNA Samples
Total RNA was isolated from the lung tissues and airway epithelia grown under different experimental conditions at the time points described in Table 1 using TRIzol (GIBCO-BRL)(4). The RNA isolated from each of the sources was pooled to construct separate RNA isolates for each source (fetal and adult lung tissues, non-CF epithelia, and CF epithelia). The resulting three pools of mRNA were poly(A) selected by oligo-dT chromatography and used for construction of three individually tagged cDNA libraries.
cDNA Libraries
Directionally cloned start (nonnormalized), normalized, and serially subtracted cDNA libraries were constructed in a plasmid vector (pT7T3-Pac) from DNase-treated poly(A)+ mRNA isolated from a number of fetal and adult lung tissues and primary cultures of human CF and non-CF airway epithelia, as previously described (3, 22). A complete list of the culture conditions used is provided in Table 1. Briefly, first-strand cDNA was primed with a poly-dT oligonucleotide (TGTTACCATTCTGATGTTGGAGCGGCCGC-N[610]-T[18]) that contained a NotI restriction site for directional cloning and a library tag, used to identify the tissue of origin (7). Double-stranded cDNA was ligated to EcoRI adaptors (5'-AATTGGCACGAGG-3', 3'-GCCGTGCTCC-5'), digested with NotI, and directionally cloned into pT7T3-Pac.
Sequencing
Dideoxy terminator sequencing was performed in 96-well format by cycle sequencing using dRhodamine dye terminator chemistry (Applied Biosystems, Foster City, CA). After thermal cycling, sequencing reactions were ethanol precipitated, resuspended in loading buffer containing formamide, denatured, and analyzed on an ABI377 or an ABI3700 capillary sequencer. A detailed description of the sequencing protocol is available online at the Univ. of Iowa Rat EST Project web page (http://ratest.eng.uiowa.edu/localdocs/sequencing_protocol.html).
After data capture on the ABI sequencers, the gels were tracked (if necessary) and transferred to a centralized server. From there, the sequences were processed as outlined below and placed into a file-system hierarchy. Nucleotide sequences and per-base quality values were extracted from the ABI-generated chromatograph files (SCF files) using the phred base-calling program (6). All of the sequences generated as a part of this research were submitted to dbEST and incorporated into the human UniGene data set.
Feature Identification and Quality Assessment
Expected EST features and overall sequence quality were assessed using ESTprep (19) and RepeatMasker (A. Smit and P. Green, unpublished data), as described in Scheetz and Casavant (17). Briefly, the features detected include vector and cloning site sequence, polyadenylation tail and signal, and potential contaminating sequences (bacterial, mitochondrial, vector). In addition, the library tag (as described in cDNA library creation) is also identified, allowing discrimination of tissue source from a pooled cDNA library. The quality assessment protocol requires that several additional criteria be satisfied: overall sequence quality (in phred q scores) greater than 25, percent of sequence (in nt) over q20 > 50%, and the quality-trimmed EST insert length of more than 100 bp.
Clustering
Local clustering of the ESTs was performed using the UIcluster program (v3.0.5) (25). Default parameters were used, with the addition of allowing matching on both forward and reverse complement. This allowed rapid and robust novelty assessment of the ESTs generated in this project, an important component of the subtractive cDNA sequencing process. The human UniGene set (Ref. 20; ftp://ftp.ncbi.nih.gov/repository/UniGene) was also used to further evaluate the novelty of the ESTs.
BLAST Analysis
BLAST-based sequence similarity was used to compare a representative element from each cluster against the nonredundant nucleotide database, dbEST, and the Affymetrix consensus sequences used to design the oligos. These sequences were obtained from the NCBI and Affymetrix web sites. A significance criterion of at least 100 bp and 90% identity was used in the BLAST analysis.
Assessment of Genomic Localization
The 3' and 5' sequences for each clone were aligned to the human genome (June 2003 release) using the BLAT alignment tool. A comparison of localization and orientation was made between the EST alignments of each clone and known genes and mRNAs in the Univ. of California Santa Cruz (UCSC) genome browser (http://genome.ucsc.edu). Assessment of novelty vs. in silico gene predictions was performed using the GenScan track on the UCSC site. The 3' ESTs with a poly-A tail should align in the opposite orientation with respect to the known transcript. Sequences that did not overlap a known mRNA sequence and showed evidence of untemplated polyadenylation were considered novel in this analysis. The quality of the BLAT hits was assessed by their alignment scores. The distances reported are the minimum distance between the mRNA and EST alignment locations. In cases where the clone sequences are partially or completely contained within an mRNA, the clone was placed in the "overlap" category.
ORF Analysis
The 3' and 5' sequences were assembled and translated in all three frames. These assemblies were then blasted against the nonredundant amino acid database from NCBI. Sequences with hits with an E-value less than 0.01 were manually inspected to assess the identity of the hit.
 |
RESULTS
|
---|
Library Construction and Sequencing
A total of seven cDNA libraries were created using mRNA isolated from airway epithelia and lung tissue, as described in MATERIALS AND METHODS. Three initial single-tissue libraries were constructed (normal fetal and adult lung, non-CF epithelia, and CF epithelia). To increase the gene discovery rate, three additional normalized cDNA libraries were made from the three starting libraries. Finally, a subtracted library was made from a pool of the two epithelia-derived libraries. From these cDNA libraries, a total of 52,980 3'-EST sequences were generated. All ESTs that passed our quality criteria (as outlined in the MATERIALS AND METHODS) were deposited in the dbEST division of GenBank and incorporated into the human UniGene data set. The clones from this library ranged from 0.352.5 kb in length. Unlike full-length sequencing projects (23), only end-sequencing was performed for the purpose of identifying novel 3' ESTs.
The ESTs generated are expected to consist primarily of untranslated sequence (UTR). In a comparison to annotated human mRNAs, ESTs from 2,228 of the 19,059 clusters aligned to one of the mRNAs, and 1,088 extended into the CDS. Thus we expect that slightly less than half of the ESTs with a polyadenylation tail and signal will contain coding sequence. This same analysis estimated an average UTR length of 772 bp.
Transcript Profiles of Nonnormalized Libraries
Although normalized and subtracted libraries are excellent for efficiently identifying a comprehensive set of mRNA transcripts, these cannot be utilized to infer an expression profile. Therefore,
1,000 clones were sequenced from each of the three initial nonnormalized libraries. The 20 most frequently sequenced transcripts from each of the three nonnormalized libraries are presented in Tables 2 4. The commonly sequenced epithelial transcripts included several gene products previously recognized for their roles in mucosal host defense. In non-CF epithelia (Table 2), these included the polymeric immunoglobulin receptor, IL-8, and ß2-microglobulin. Epithelial cytoskeletal and adhesion related gene products sequenced included ß1-integrin and annexin A1. In addition, many genes with "housekeeping functions" were identified including ribosomal subunit RNAs, chaperones, ß-actin, and cellular enzymes. The CF epithelial library (Table 3) shared many transcripts with the non-CF library. The 3' ESTs frequently sequenced from CF epithelia included keratin 19, CD74, cathepsin D, MEN1, and properdin B-factor. In addition, two abundant transcripts of epithelial origin sequenced more frequently in the CF library included the human homolog of the mouse palate, lung, and nasal epithelium clone "PLUNC" (also termed LUNX or SPLUNC1) and the von Ebner minor salivary gland protein (also termed LPLUNC1) (2). The most commonly sequenced genes from the lung library (Table 4) included surfactant protein C, surfactant protein A1, and many "housekeeping" genes.
Gene Discovery
Figure 1 presents the incremental novelty resulting from the sequencing effort using the examples of the three normalized libraries and the subtracted library from airway epithelia. Here, the benefits of cDNA library normalization and subtraction are evident. For example, as sequencing of each normalized library began (CF and non-CF airway epithelia, whole lung), the novelty rate of the output increased substantially. Similarly, as the subtracted library of pooled normalized epithelial ESTs was sequenced, the novelty rate again increased. A typical strategy in gene discovery projects utilizes sequence-homology based clustering to estimate the number of unique genes identified by a set of ESTs. Often polyadenylation tail and signal are also utilized as inclusion criteria to ensure that only those sequences representing bona fide 3' ends of transcripts contribute to the final number of genes. This helps avoid counting nonoverlapping "islands" of ESTs derived from the same gene as representing different transcripts.

View larger version (19K):
[in this window]
[in a new window]
|
Fig. 1. Graphical representation of gene discovery novelty over the course of sequencing of 3' expressed sequence tags (ESTs) in four libraries. Results are from sequencing of cDNA libraries from cystic fibrosis (CF) epithelia (nonnormalized and normalized); non-CF epithelia (nonnormalized and normalized); pooled lung (nonnormalized and normalized); and a subtracted cDNA library derived from a pooled version of the normalized non-CF and CF epithelia. The total number of sequences generated from the CF epithelia, non-CF epithelia, and pooled lung libraries were 9,633, 12,299, and 12,914, respectively.
|
|
Of the 15,227 UniGene clusters contributed to by this sequencing effort, our ESTs were the first evidence of lung expression for 6,416 clusters. Of these, 1,426 clusters contained ESTs only from our cDNA libraries (i.e., were uniquely discovered by UI ESTs). Importantly, 429 of the UI-discovered clusters contained a polyadenylation tail and canonical signal, and another 190 contained a polyadenylation tail and an alternative polyadenylation signal (1). The ESTs were also clustered locally based upon sequence similarity using UIcluster, resulting in a nonredundant set of 19,059 clusters. The majority of the difference between the UI-derived clustering (19,059) and the NCBI UniGene clustering (15,227) was caused by the requirement for evidence of polyadenylation in the UniGene set. The UIcluster procedure does not share this requirement, and thus it is expected that UIcluster would generate a larger number of clusters. Although the gross number of clusters differed between the UniGene sets derived by UIcluster and NCBI, the cluster composition was very similar. Previously reported results demonstrated 2.55% difference in cluster composition between the two strategies (26).
A graphical representation of the relative discovery from the tissue sources utilized in building the cDNA libraries is presented in Fig. 2. Each of the circles in the Venn diagram presents the number of clusters containing at least one sequence from that tissue. It is important to note that for 660 clusters the tissue source could not be determined, and these were therefore not included in Fig. 2. The places where the circles overlap denote clusters with sequences derived from two or more of the tissues. From Fig. 2, several points can be made. First, of the 19,059 clusters identified, 1,932 contain messages common to all three tissues. Second, each tissue uniquely contributes a few thousand clusters. Thus each of the starting tissues contributed to the overall gene discovery process. Within the UniGene build, 2,014 of the clusters were lung specific (i.e., comprised only of ESTs derived from lung tissues) and 1,190 were epithelia specific. We observed a substantially higher number of unique sequences contributed from the non-CF epithelial library (5,686 3'-ESTs) than from the CF epithelial library (3,488 3'-ESTs). This result was expected, as the non-CF epithelia samples were treated with many more conditions designed to induce gene expression (Table 1). Confirming our prediction that the plasticity of the airway epithelial transcription profile would be highly regulated by environmental and nonenvironmental factors, both epithelial libraries contributed substantially to the collection of 19,059 clusters.

View larger version (21K):
[in this window]
[in a new window]
|
Fig. 2. Venn diagram representing the relative contribution to the UniGene collection from non-CF epithelia, CF epithelia, and lung tissue. (Note: 660 clusters lacked unambiguous library tag identification.)
|
|
Epithelial Specificity of Transcripts Sequenced
One question of immediate interest was how well the known epithelial-specific genes were represented in this UniGene set. Table 5 presents a selected list of categories of genes of epithelial origin that are represented in the libraries. For example, the CFTR mRNA, a transcript known to be expressed in airway epithelia at the level of approximately one to two copies/cell (24) was identified. Several other examples, representing gene families involved in ion transport, receptor function, cytokines, host defense, signaling, enzymatic functions, and intracellular adhesion are also shown.
Novelty Assessment
Figure 3 illustrates the contributions of each of the starting tissue sources to the discovery of novel ESTs. Of the 1,426 UniGene clusters described above as only represented from our cDNA libraries, 1,366 had unambiguous library tags allowing tissue assignment. The relative contributions of the three tissue sources to the novel transcript discovery reflect the diversity illustrated in Fig. 2, with most novel sequences originating from the epithelial libraries.

View larger version (19K):
[in this window]
[in a new window]
|
Fig. 3. Venn diagram illustrating the contributions of each of the starting tissue sources to the discovery of novel ESTs. Of 1,426 UniGene clusters unique to our cDNA libraries, 1,366 had unambiguous library tags allowing assignment. The relative contributions of the three tissue sources to the novelty illustrated in Fig 2 are shown.
|
|
Another measure of novelty is the comparison of the identified set of 19,059 human airway epithelial and lung clusters with the probes on the Affymetrix U133 human GeneChip set (http://www.Affymetrix.com). A total of 5,961 of the airway and lung clusters identified lacked any significant homology to the consensus sequences utilized to derive the individual probe sequences. Of these, the most significant are those with a polyadenylation tail and signal (canonical or alternative). These sequences likely represent novel transcripts and/or alternative 3' ends of transcripts already present in the Affymetrix U133 set. Because of the significant length bias (typical labeled cRNA
600 bp) during the labeling reaction, labeled targets derived from these unrepresented (or poorly represented) transcripts are unlikely to hybridize with the Affymetrix GeneChip probe sets. In other words, the limited length of labeled targets implies that probes not specifically designed for the prevalent lung transcripts are unlikely to hybridize. Therefore, it would be difficult to use transcript profiling with current commercial arrays to investigate their importance in the development, progression, or treatment of CF or other lung diseases.
From the complete set of 3' EST sequences submitted to GenBank, 3,168 were not selected for inclusion within the current UniGene build. Although these 3,168 ESTs were not represented within the current UniGene build, these were available for incorporation into UniGene and were included within the local clustering. These sequences defined a set of 879 clusters comprising only sequences not included in NCBI UniGene set.
A representative clone was selected from each of the 879 clusters not included in UniGene set. These clones were resequenced from the 3' and 5' ends to further assess their novelty. A total of 491 of these clones were further validated as novel based upon the lack of a significant BLAST hit (other than themselves) to a database of all human ESTs in dbEST. Those with a weak BLAST hit (less than 90% identity over 100 nt) are probably homologs of known genes. Those ESTS lacking any BLAST hit likely represent either novel transcripts or previously unobserved 3' ends. A final sequence composition analysis was applied to these 491 clones, identifying 199 clones in which the 3' EST contained polyadenylation tail and signal (canonical or alternative). As mentioned above, those sequences lacking a polyadenylation tail and/or signal likely represent internal sequence for previously discovered but incompletely characterized/sequenced genes. Both the 3' and 5' sequences from these 199 clones were aligned to the human genome using BLAT (11). Of these 199 sequences, 134 were determined to be the result of untemplated polyadenylation based upon alignment to the human genome (i.e., the homology to the genomic sequence does not extend into the polyadenylation tail). The genomic location and context were then evaluated using the UCSC genome browser (http://genome.ucsc.edu/; Ref. 12). Specifically, the sequences were evaluated to determine whether they were associated with novel forms of known transcripts or represented potentially novel transcripts.
Only transcripts in the proper orientation were considered in this analysis, the results of which are presented in Fig. 4. Of these re-arrayed clones, 61 overlapped at least partially with previously reported transcripts, and another 5 fell within 1 kb of known genes. These clones most likely represent novel 3' ends of previously identified transcripts. Another 21 ESTs mapped within 1 and 10 kb of known transcripts. The identity of this group of sequences is more challenging to definitively classify. Because they were found to lie further from known transcripts, the probability that they are derived from a different (novel) transcript increases. It is likely that some of the 16 clones localizing within 15 kb of a known gene represent products of different transcriptional units. However, the majority may represent alternative 3' splicing and/or polyadenylation events. The five clones that localize even further (510 kb) from neighboring transcripts are more likely to represent novel transcripts, rather than additional 3' sequence for known transcripts. Of special note are the 47 transcripts that did not localize within 10 kb of any reported human mRNA sequence. It is quite likely that these clones represent previously unidentified transcripts. They may be low-abundance transcripts or may be specific to lung epithelia. A list of these transcripts is available in the online data supplement (Supplemental Table 6, available at the Physiological Genomics web site).1

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 4. Further characterization of 3' ESTs not represented in the human UniGene set. The clones were re-arrayed and sequenced from the 5' and 3' ends. The distribution of 134 of the 3' and 5' EST pair alignments in the human genome are presented in the context of their relationship to known genes and mRNAs in the UCSC genome database. Sequences are defined by their distance from the nearest known gene.
|
|
To further assess the novelty of the 47 sequences determined to be the most novel, two additional analyses were performed. The first was based on open-reading frame (ORF) analysis, to assess similarity to known genes, and the second was a comparison to the in silico predicted gene structures. The only homologies found to the ORFs identified from the ESTs were to Alu sequences (4 identified), and hypothetical proteins (3 identified; best E-value e-07). Of the 47 novel sequences, 10 fell within annotated in silico predicted introns, i.e., they did not align within predicted exons (transcribed sequences). One interpretation of this result is that these ESTs represent 3' ends of these in silico predicted genes. Two more novel sequences fell near in silico predicted genes, and one partially overlapped with an in silico predicted exon.
 |
DISCUSSION
|
---|
This study demonstrates the utility of a focused gene discovery effort in identifying transcripts of the lung and specifically the surface airway epithelium. Application of the combined approaches of cDNA library normalization and subtraction facilitated efficient gene discovery in airway epithelia. This approach has been effective in numerous gene discovery projects (see http://genome.uiowa.edu/clcg.html; Ref. 3). Importantly, the resultant UniGene set consists of 19,059 nonredundant cDNAs, including 879 not represented in the current NCBI UniGene build no. 161.
From these 879 clusters, 80 were eventually determined to have a high probability of representing novel transcripts. None of the novel ESTs identified were included in previous human genome annotations, indicating that they were missed. These findings demonstrate how focused cell- or tissue-specific gene discovery may reveal novel alternative transcripts of known genes and identify many new genes. They also call into question current estimates of
25,00030,000 genes in the human genome (13, 16, 27). The functions of these transcripts are unknown at present.
Of significant interest from the sequencing of nonnormalized libraries were the contrasts among the transcripts derived from epithelia and those from the lung (Tables 24). The abundant transcripts from the lung libraries included many sequences recognized for their "housekeeping" functions. In contrast, the more frequently sequenced epithelial ESTs included antimicrobial proteins, cytokines, immunoregulatory genes, and genes involved in cellular metabolism. This is consistent with the role of the airway epithelium as an important interface between the host and the environment. The observation that some sequences were more frequently identified from the CF libraries than the non-CF (i.e., keratin 19, LPLUNC1, and PLUNC) may merely reflect the greater number of treatments applied to the non-CF epithelia (see Table 1) and should be further investigated in additional studies.
These results confirm and emphasize the potential yields from focused gene discovery efforts in specific underrepresented cells and tissues and the value of in vitro manipulation of the cells prior to isolating input RNA for library construction and gene discovery. Our findings are consistent with previous findings in other organisms and tissues [human (9), mouse (14), rat (18)].
In summary, we generated a UniGene collection comprising more than 19,000 transcripts expressed in human airway epithelia and lung, including many novel transcripts and hundreds of sequences not represented on commercial arrays. This gene collection may have broad applications for gene discovery and will be useful for large-scale expression analysis for investigators interested in lung diseases.
 |
GRANTS
|
---|
This work was supported by a Functional Genomics Research Center Grant from the Cystic Fibrosis Foundation (McCray00V0). We also acknowledge the support of the In Vitro Models and the Cell Culture Core and the Cell Morphology Core, partially supported by the Cystic Fibrosis Foundation, National Institutes of Health (NIH) Grants HL-51670 and HL-61234, and by the Center for Gene Therapy for Cystic Fibrosis (NIH Grant P30-DK-54759). Fetal lung tissues were obtained from the Central Laboratory for Human Embryology at the University of Washington. M. J. Welsh is an investigator of the Howard Hughes Medical Institute.
 |
ACKNOWLEDGMENTS
|
---|
We are grateful for the contributions of Phil Karp and Pary Weber for culturing the human epithelial cells. We acknowledge the assistance and support of the UI CLCG sequence processing group including Brian OLeary, Michael Smith, Christopher Moressi, Barry Gackle, Brian Mokrzycki, Dylan Tack, and A. Jason Grundstad. We acknowledge the assistance of Keith Crouch, Hana Itani, Mindee Perdue, and Kelly Schaefer with template preparation. We appreciate the assistance of Kurtis Trout with clone arraying, re-arraying, and replicating. Catherine Keppel, Mark Lebeck, and Christina Smith provided assistance with sequencing.
The complete set of 19,059 nonredundant lung and epithelial expressed clones is available from Open Biosystems (http://www.openbiosystems.org).
 |
FOOTNOTES
|
---|
Article published online before print. See web site for date of publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: P. B. McCray, Jr., Dept. of Pediatrics, 240-G EMRB, Univ. of Iowa College of Medicine, Iowa City, IA 52242 (E-mail: paul-mccray{at}uiowa.edu).
10.1152/physiolgenomics.00188.2003.
1 The Supplementary Material for this article (Supplementary Table 6, a list of transcripts) is available online at http://physiolgenomics.physiology.org/cgi/content/full/00188.2003/DC1. 
 |
REFERENCES
|
---|
- Beaudoing E, Freier S, Wyatt JR, Claverie JM, and Gautheret D. Patterns of variant polyadenylation signal usage in human genes. Genome Res 10: 10011010, 2000.[Abstract/Free Full Text]
- Bingle CD and Craven CJ. PLUNC: a novel family of candidate host defence proteins expressed in the upper airways and nasopharynx. Hum Mol Genet 11: 937943, 2002.[Abstract/Free Full Text]
- Bonaldo MF, Lennon G, and Soares MB. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res 6: 791806, 1996.[Abstract]
- Chomczynski P and Sacchi N. Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction. Anal Biochem 162: 156159, 1987.[CrossRef][ISI][Medline]
- Davis PB. Cystic fibrosis. Pediatr Rev 22: 257264, 2001.[Free Full Text]
- Ewing B, Hillier L, Wendl MC, and Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8: 175185, 1998.[Abstract/Free Full Text]
- Gavin AJ, Scheetz TE, Roberts CA, OLeary B, Braun TA, Sheffield VC, Soares MB, Robinson JP, and Casavant TL. Pooled library tissue tags for EST-based gene discovery. Bioinformatics 18: 11621166, 2002.[Abstract/Free Full Text]
- Harder J, Meyer-Hoffert U, Teran LM, Schwichtenberg L, Bartels J, Maune S, and Schroder JM. Mucoid Pseudomonas aeruginosa, TNF-alpha, and IL-1beta, but not IL-6, induce human beta-defensin-2 in respiratory epithelia. Am J Respir Cell Mol Biol 22: 714721, 2000.[Abstract/Free Full Text]
- Hillier LD, Lennon G, Becker M, Bonaldo MF, Chiapelli B, Chissoe S, Dietrich N, DuBuque T, Favello A, Gish W, Hawkins M, Hultman M, Kucaba T, Lacy M, Le M, Le N, Mardis E, Moore B, Morris M, Parsons J, Prange C, Rifkin L, Rohlfing T, Schellenberg K, Marra M. Generation and analysis of 280,000 human expressed sequence tags. Genome Res 6: 807828, 1996.[Abstract]
- Karp PH, Moninger TO, Weber SP, Nesselhauf TS, Launspach JL, Zabner J, and Welsh MJ. An in vitro model of differentiated human airway epithelia. Methods for establishing primary cultures. Methods Mol Biol 188: 115137, 2002.[Medline]
- Kent WJ. BLAT: the BLAST-like alignment tool. Genome Res 12: 656664, 2002.[Abstract/Free Full Text]
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, and Haussler D. The human genome browser at UCSC. Genome Res 12: 9961006, 2002.[Abstract/Free Full Text]
- Lander ES et al. (International Human Genome Sequencing Consortium). Initial sequencing and analysis of the human genome. Nature 409: 860921, 2001.[CrossRef][ISI][Medline]
- Marra M, Hillier L, Kucaba T, Allen M, Barstead R, Beck C, Blistain A, Bonaldo M, Bowers Y, Bowles L, Cardenas M, Chamberlain A, Chappell J, Clifton S, Favello A, Geisel S, Gibbons M, Harvey N, Hill F, Jackson Y, Kohn S, Lennon G, Mardis E, Martin J, Mila L, McCann R, Morales R, Pape D, Person B, Prange C, Ritter E, Soares M, Schurk R, Shin T, Steptoe M, Swaller T, Theising B, Underwood K, Wylie T, Yount T, Wilson R, and Waterston R. An encyclopedia of mouse genes. Nat Genet 21: 191194, 1999.[CrossRef][ISI][Medline]
- Mercer RR, Russell ML, Roggli VL, and Crapo JD. Cell number and distribution in human and rat airways. Am J Respir Cell Mol Biol 10: 613624, 1994.[Abstract]
- Pennisi E. Bioinformatics. Gene counters struggle to get the right answer. Science 301: 10401041, 2003.[Abstract/Free Full Text]
- Scheetz TE and Casavant TL. Informatics for efficient EST-based gene discovery in normalized and subtracted cDNA libraries. In: The Practical Bioinformatician, edited by Wong L. River Edge, NJ: World Scientific, 2004.
- Scheetz TE, Laffin JL, Berger B, Mackerly S, Baumes SA, Brown IIR, Chang S, Coco J, Conklin J, Crouch K, Donohue M, Doonan G, Estes C, Eyestone M, Fishler K, Gardiner J, Guo L, Johnson B, Keppel C, Kreger R, Lebeck M, Marcelino R, Miljkovich V, Perdue M, Qui L, Rehmann J, Reiter RS, Rhoads B, Schaefer K, Smith C, Sunjevaric I, Trout K, Wu N, Birkett CL, Bischof J, Gackle B, Gavin A, Grundstad AJ, Mokrzycki B, Moressi C, OLeary B, Pedretti K, Roberts CA, Robinson NL, Smith M, Tack D, Trivedi N, Kucaba T, Freeman T, Lin J, Bonaldo MF, Casavant TL, Sheffield VC, and Soares MB. High-throughput gene discovery in the rat. Genome Res In press, 2004.
- Scheetz TE, Trivedi N, Roberts CA, Kucaba T, Berger B, Robinson NL, Birkett CL, Gavin AJ, OLeary B, Braun TA, Bonaldo MF, Robinson JP, Sheffield VC, Soares MB, and Casavant TL. ESTprep: preprocessing cDNA sequence reads. Bioinformatics 19: 13181324, 2003.[Abstract/Free Full Text]
- Schuler GD. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J Mol Med 75: 694698, 1997.[CrossRef][ISI][Medline]
- Singh PK, Jia HP, Wiles K, Hesselberth J, Liu L, Conway BD, Greenberg EP, Valore EV, Welsh MJ, Ganz T, Tack BF, and McCray PB Jr. Production of ß-defensins by human airway epithelia. Proc Natl Acad Sci USA 95: 1496114966, 1998.[Abstract/Free Full Text]
- Soares MB and Bonaldo MF. Construction and screening of normalized cDNA libraries. In: Genome Analysis: A Laboratory Manual, edited by Birren B, Green ED, Klapholz S, Myers R, and Roskams J. New York: Cold Spring Harbor Laboratory Press, 1998, p. 49157.
- Strausberg RL, Feingold EA, Klausner RD, and Collins FS. The mammalian gene collection. Science 286: 455457, 1999.[Abstract/Free Full Text]
- Trapnell BC, Chu CS, Paakko PK, Banks TC, Yoshimura K, Ferrans VJ, Chernick MS, and Crystal RG. Expression of cystic fibrosis transmembrane conductance regulator gene in the respiratory tract of normal individuals and individuals with cystic fibrosis. Proc Natl Acad Sci USA 88: 65656569, 1991.[Abstract]
- Trivedi N, Bischof J, Davis S, Pedretti K, Scheetz TE, Braun TA, Roberts CA, Robinson NL, Sheffield VC, Soares MB, and Casavant TL. Parallel creation of non-redundant gene indices from partial mRNA transcript. Fut Generation Comput Syst 18: 863870, 2002.[CrossRef][ISI]
- Trivedi N, Pedretti KT, Braun TA, Scheetz TE, and Casavant TL. Alternative parallelization strategies in EST clustering. 7th Int Conf on Parallel Compiler Technologies, 2003.
- Venter JC et al. (Celera Genomics). The sequence of the human genome. Science 291: 13041351, 2001.[Abstract/Free Full Text]
- Welsh MJ, Ramsey BW, Accurso F, and Cutting GR. Cystic fibrosis. In: The Metabolic and Molecular Basis of Inherited Disease (8 ed.), edited by Scriver CR, Beaudet AL, Sly WS, Valle D, Childs B, and Vogelstein B. New York, NY: McGraw-Hill, 2001, p. 51215189.