Information Retrieval: A Health and Biomedical Perspective, Third Edition

William Hersh, M.D.

Chapter 9 Update

This update contains all new references cited in the author's OHSU BMI 514/614 course for Chapter 9.

One natural language processing (NLP) that has been used both for patient-specific and knowledge-based information is MetaMap, which was developed at the National Library of Medicine. A recent paper gives a historical update and describes recent enhancements (Aronson, 2010).

There is also a new book (Cohen and Demner-Fushman, 2014) as well as several book chapters (Denny, 2012; Cohen and Hunter, 2013; Chen and Sarkar, 2014; Doan et al., 2014).
Aronson, A. and Lang, F. (2010). An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17: 229-236.
Cohen, KB and Demner-Fushman, D (2014). Biomedical Natural Language Processing. Amsterdam, Netherlands, John Benjamins Publishing.
Denny, JC (2012). Mining Electronic Health Records in the Genomics Era. PLOS Computational Biology: Translational Bioinformatics. M. Kann and F. Lewitter.
Cohen, KB and Hunter, LE (2013). Text mining for translational bioinformatics. PLoS Computational Biology. 9(4): e1003044.
Chen, ES and Sarkar, IN (2014). Mining the Electronic Health Record for Disease Knowledge. In Biomedical Literature Mining. V. Kumar and H. Tipney. New York, NY, Springer: 269-286.
Doan, S, Conway, M, et al. (2014). Natural Language Processing in Biomedicine: A Unified System Architecture Overview. In Clinical Bioinformatics. R. Trent. New York, NY, Springer. 275-294.
A systematic review was published in 2010 that reviewed all of the automated coding and classification studies in clinical natural language processing (NLP) (Stanfill et al., 2010 - a paper that started out as an OHSU BMI 510 term paper and culminated in a master's capstone!). The aggregation of studies showed a wide variety of clinical areas where NLP was used and an equally wide variety of results, which were usually measured in terms of recall and precision. One of the major unanswered questions, raised in a paper I wrote in 2005, is how good is "good enough" in clinical NLP? Chapman et al. (2011) also note a need for more and varied evaluation tasks and larger test collections. Stanfill, MH, Williams, M, et al. (2010). A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association. 17: 646-651.
Hersh, W (2005). Evaluation of biomedical text mining systems: lessons learned from information retrieval. Briefings in Bioinformatics. 6: 344-356.
Chapman, W., Nadkarni, P., et al. (2011). Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the American Medical Informatics Association, 18: 540-543.
The i2b2 challenge evaluations have continued and drawn a wide variety of research groups. The tasks it has covered in its different years include:
  • Automated de-identification of records (Uzuner et al., 2007)
  • Identification of smoking status from medical discharge summaries (Uzuner et al., 2008)
  • Identification of obesity and its co-morbidities (Uzuner, 2009)
  • Extracting medication information (Uzuner et al., 2010)
  • Relationships between concepts (entities) in clinical text (Uzuner et al., 2011)
  • Co-reference resolution and sentiment classification (Uzuner et al., 2012)
  • Detection of temporal relations (Sun et al., 2013)
  • De-identification and risk factor detection (Uzuner and Stubbs, 2015)
  • Classify symptom severity in a domain for a patient, based on initial psychiatric evaluation (2016 task - no publications yet)

Uzuner, O., Luo, Y., et al. (2007). Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14: 550-563.
Uzuner, O., Goldstein, I., et al. (2008). Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association, 15: 14-24.
Uzuner, O. (2009). Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association, 16: 561-570.
Uzuner, O., Solti, I., et al. (2010). Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17: 514-518.
Uzuner, Ö., South, B., et al. (2011). 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18: 552-556.
Uzuner, O., Bodnari, A., et al. (2012). Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association: Epub ahead of print.
Sun, W, Rumshisky, A, et al. (2013). Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. Journal of the American Medical Informatics Association. 20: 806-813.
Uzuner, O and Stubbs, A (2015). Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of Biomedical Informatics. 58(Suppl): S1-S5.

One task of the i2b2 challenge has continued to be an important research area, not only for its intrinsic value, but also its value in enabling clinical NLP research. This is automated de-identification of clinical narratives. A recent review of research on this topic found a variety of methods used on an equal varying number of document types, making comparison of approaches and systems difficult (Meystre et al., 2010).

Stubbs and Uzuner (2015) describe the corpus for the i2b2 de-identification task from 2015.
Meystre, SM, Friedlin, FJ, et al. (2010). Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology. 10: 70.
Stubbs, A and Uzuner, O (2015). Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth corpus. Journal of Biomedical Informatics. 58: S20-S29.
Another development is the emergence of large projects focused on use of clinical NLP. Of the most productive of these is the Electronic Medical Records and Genomics (eMERGE) Network, a large-scale consortium that links the growing number of DNA biorepositories with electronic health record (EHR) systems for "large-scale, high-throughput genetic research" (McCarty et al., 2011; Wilke et al., 2011). Some of the findings include use of clinical NLP for:
  • Identifying QT prolongation in ECG reports (Denny, 2009)
  • Replicating finding of known gene-disease associations from research data in EHR data for several diseases (Denny et al., 2010; Ritchie et al., 2010)
  • Discovering new gene-disease associations (Denny et al., 2010)
  • Identifying genomic variants associated with atrioventricular conduction abnormalities (Denny et al., 2010), red blood cell traits (Kullo et al., 2010), thyroid disorders (Denny et al., 2011), white blood cell count abnormalities (Crosslin et al, 2012)
  • Identification of patients needing colorectal cancer testing (Denny et al., 2012)
A recent book chapter provides an overview of EHR record-mining in the eMERGE Network (Denny, 2012) while a recent paper outlines a number of "lessons learned" (Newton et al., 2013):
  • Multisite validation improves accuracy pf the phenotype algorithm
  • Validation targets must be carefully considered and defined
  • Specification of time frames for variables makes validation easier and improves accuracy
  • Using repeated measures requires defining the relevant time period and specifying the most meaningful value to be studied
  • Patient movement in and out of health systems can result in incomplete or fragmented data
  • The review scope should be defined carefully
  • Care is required in combining EMR and research data
  • Medication data is best assessed using claims, medications dispensed, or medications prescribed
  • Algorithm development and validation work should be an iterative process
  • Validation by content experts or structured chart review is critical for accurate results
McCarty, C., Chisholm, R., et al. (2010). The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Genomics, 4(1): 13.
Wilke, R., Xu, H., et al. (2011). The emerging role of electronic medical records in pharmacogenomics. Clinical Pharmacology and Therapeutics, 89: 379-386.
Denny, J., Miller, R., et al. (2009). Identifying QT prolongation from ECG impressions using a general-purpose natural language processor. International Journal of Medical Informatics, 78(Suppl 1): S34-42.
Denny, J., Ritchie, M., et al. (2010). Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation, 122: 2016-2021.
Ritchie, M., Denny, J., et al. (2010). Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. 
American Journal of Human Genetics, 86: 560-572.
Denny, J., Ritchie, M., et al. (2010). PheWAS: Demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics, 26: 1205-1210.
Kullo, LJ, Ding, K, et al. (2010). A genome-wide association study of red blood cell traits using the electronic medical record. PLoS ONE. 5(9): e13011.
Denny, JC, Crawford, DC, et al. (2011). Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. American Journal of Human Genetics. 89: 529-542.
Crosslin, DR, McDavid, A, et al. (2012). Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Human Genetics. 131: 639-652.

Denny, J., Choma, N., et al. (2012). Natural language processing improves identification of colorectal cancer testing in the electronic medical record. Medical Decision Making, 32: 188-197.
Denny, JC (2012). Mining Electronic Health Records in the Genomics Era. PLOS Computational Biology: Translational Bioinformatics. M. Kann and F. Lewitter.
Newton, KM, Peissig, PL, et al. (2013). Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association. 20(e1): e147-154.

An additional large-scale project is one of the four collaborative research centers developed as part of the Strategic Health IT Advanced Research Projects (SHARP) Program of the Office of the National Coordinator for Health IT ONC). The SharpN Project aims to "transform EHR data into standards-conforming, comparable information suitable for large-scale analyses, inferencing, and integration of disparate health data" (Chute et al., 2011; Rea et al., 2012). Projects include data normalization; clinical NLP consisting of extraction from clinical free text based on standards and interoperability as well as transformation of unstructured text into structured data; high-throughput phenotyping; and data quality assessment. The SharpN NLP builds on earlier work in NLP coming from the Mayo Clinic in the clinical Text Analysis and Knolwedge Extraction System (cTAKES) (Savova et al., 2010). Chute, C., Pathak, J., et al. (2011). The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. AMIA Annual Symposium Proceedings 2011, Washington, DC. 248-256.
Rea, S., Pathak, J., et al. (2012). Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: The SHARPn project. 
Journal of Biomedical Informatics, 45: 763-771.
Savova, G., Masanz, J., et al. (2010). Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17: 507-513.
The chapter omitted some older research in clinical NLP:
  • Negation detection (Chapman et al., 2001) – developed the NegEx system to detect negation in clinical charts
  • Syndromic surveillance of emergency department chief complaints (Chapman et al., 2005) – achieved low sensitivity but high specificity
  • Description of the system of Denny et al. used in the eMERGE Project (Denny et al., 2005)
  • Completeness of findings in spinal disability exams (Brown et al., 2006)
  • Clinical research – finding patients with congestive heart failure (Pakhomov et al., 2007) and classifying foot examination results in patients with diabetes (Pakhomov et al., 2008)

Chapman, W., Bridewell, W., et al. (2001). A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34: 301-310.
Chapman, W., Dowling, J., et al. (2005). Classification of emergency department chief complaints into 7 syndromes: a retrospective analysis of 527,228 patients. 
Annals of Emergency Medicine, 46: 445-455.
Denny, JC, Spickard, A, et al. (2005). Identifying UMLS concepts from ECG Impressions using KnowledgeMap. AMIA Annual Symposium Proceedings, Washington, DC. 196-200.
Brown, S., Speroff, T., et al. (2006). eQuality: electronic quality assessment from narrative clinical reports. Mayo Clinic Proceedings, 81: 1472-1481.
Pakhomov, S., Weston, S., et al. (2007). Electronic medical records for clinical research: application to the identification of heart failure. American Journal of Managed Care, 13: 281-288.
Pakhomov, S., Hanson, P., et al. (2008). Automatic classification of foot examination findings using statistical natural language processing and machine learning. Journal of the American Medical Informatics Association, 15: 198-202.
There continues to be newer clinical NLP research as well:
  • Extracting cancer disease characteristics from pathology reports (Coden et al., 2009)
  • Identifying section headers in clinical documents (Denny et al., 2009)
  • Determining aspirin use in patients with cardiovascular disease (Pakhomov et al., 2010)
  • Improved information retrieval using concept associations in records (Hristidis et al., 2010)
  • Processing emergency department notes for syndromic surveillance (Gerbier et al., 2011)
  • Identification of post-operative complications in clinical notes (Murff et al., 2011)
  • Identification of influenza for biosurveillance from encounter notes (Elkin et al., 2012)
  • Detection of Charlton comorbidities (Singh et al., 2012)
  • Discovery of postoperative complications (Fitzhenry et al., 2013; Tien et al., 2015)
  • Identification of drug and food allergies entered using non-standard terminology (Epstein et al., 2013)
  • Information extraction from narrative pathology reports and automatic population of structured templates (Ou and Partrick, 2014)
  • Novel associations between ICD-9 codes in clinical data and those recognized by MetaMap in MEDLINE records (Hanauer et al., 2014)
  • High-throughput phenotyping (Yu et al., 2015)
  • Case detection of diabetes (Zheng et al., 2016)
  • Identification of high-risk heart failure patients (Evans et al., 2016)
  • Diseases associated with HLA variants (Karnes et al., 2017)
  • Correlation of mammographic and pathologic findings in clinical decision support (Patel et al., 2017)
Coden, A., Savova, G., et al. (2009). Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model. Journal of Biomedical Informatics, 42: 937-949.
Denny, J., Spickard, A., et al. (2009). Evaluation of a method to identify and categorize section headers in clinical documents. Journal of the American Medical Informatics Association, 16: 806-815.
Pakhomov, S., Shah, N., et al. (2010). Automated processing of electronic medical records is a reliable method of determining aspirin use in populations at risk for cardiovascular events. Informatics in Primary Care, 18: 125-133.
Hristidis, V., Varadarajan, R., et al. (2010). Information discovery on electronic health records using authority flow techniques. BMC Medical Informatics & Decision Making, 10: 64.
Gerbier, S., Yarovaya, O., et al. (2011). Evaluation of natural language processing from emergency department computerized medical records for intra-hospital syndromic surveillance. BMC Medical Informatics & Decision Making, 11:50.
Murff, H., FitzHenry, F., et al. (2011). Automated identification of postoperative complications within an electronic medical record using natural language processing. Journal of the American Medical Association, 306: 848-855.
Elkin, P., Froehling, D., et al. (2012). Comparison of natural language processing biosurveillance methods for identifying influenza from encounter notes. Annals of Internal Medicine, 156: 11-18.
Singh, B, Singh, A, et al. (2012). Derivation and validation of automated electronic search strategies to extract Charlson comorbidities from electronic medical records. Mayo Clinic Proceedings. 87: 817-824.
FitzHenry, F, Murff, HJ, et al. (2013). Exploring the frontier of electronic health record surveillance: the case of postoperative complications. Medical Care. 51: 509-516.
Tien, M, Kashyap, R, et al. (2015). Retrospective derivation and validation of an automated electronic search algorithm to identify post operative cardiovascular and thromboembolic complications. Applied Clinical Informatics. 6: 565-576.
Epstein, RH, StJacques, P, et al. (2013). Automated identification of drug and food allergies entered using non-standard terminology. Journal of the American Medical Informatics Association. 20: 962-968.
Ou, Y and Patrick, J (2014). Automatic structured reporting from narrative cancer pathology reports. e-Journal of Health Informatics. 8(2): e20.
Hanauer, DA, Saeed, M, et al. (2014). Applying MetaMap to Medline for identifying novel associations in a large clinical dataset: a feasibility analysis. Journal of the American Medical Informatics Association. 21: 925-937.
Yu, S, Liao, KP, et al. (2015). Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. Journal of the American Medical Informatics Association. 22: 993-1000.
Zheng, L, Wang, Y, et al. (2016). Web-based real-time case finding for the population health management of patients with diabetes mellitus: a prospective validation of the natural language processing–based algorithm with statewide electronic medical records. JMIR Medical Informatics. 4(4): e37.
Evans, RS, Benuzillo, J, et al. (2016). Automated identification and predictive tools to help identify high-risk heart failure patients: pilot evaluation. Journal of the American Medical Informatics Association. 23: 872-878.
Karnes, JH, Bastarache, L, et al. (2017). Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants. Science Translational Medicine. 9: eaai8708.
Patel, TA, Puppala, M, et al. (2017). Correlating mammographic and pathologic findings in clinical decision support using natural language processing and data mining methods. Cancer. 123: 114-121.
There have also been systematic reviews of NLP in medical fields:
  • Oncology (Spasić et al., 2014; Yim et al., 2016)
  • Pathology (Burger et al., 2016)
  • Radiology (Pons et al., 2016)
Spasić, I, Livsey, J, et al. (2014). Text mining of cancer-related information: review of current status and future directions. International Journal of Medical Informatics. 83: 605-623.
Yim, WW, Yetisgen, M, et al. (2016). Natural language processing in oncology: a review. JAMA Oncology. 2: 797-804.
Burger, G, Abu-Hanna, A, et al. (2016). Natural language processing in pathology: a scoping review. Journal of Clinical Pathology. 69: 949-955.
Pons, E, Braun, LMM, et al. (2016). Natural language processing in radiology: a systematic review. Radiology. 279: 329-343.
One of the challenges for clinical notes comes from the "tension" that clinicians face between documentation systems that are structured, which enable easier processing of the data and text, versus allowing flexibility of what is entered (Rosenbloom et al., 2011). Another challenge is the lack of large-scale shared data (Chapman et al., 2011; Friedman, 2013), although larger annotated corpora are being developed (Albright et al., 2013).
Rosenbloom, S., Denny, J., et al. (2011). Data from clinical notes: a perspective on the tension between structure and flexible documentation. Journal of the American Medical Informatics Association, 18: 181-186.
Chapman, W., Nadkarni, P., et al. (2011). Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the American Medical Informatics Association, 18: 540-543.
Albright, D, Lanfranchi, A, et al. (2013). Towards comprehensive syntactic and semantic annotations of the clinical narrative. Journal of the American Medical Informatics Association, 20: 922-930.
Friedman, C, Rindflesch, TC, et al. (2013). Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. Journal of Biomedical Informatics. 46: 765-773.
An overview of opinions on "the way forward" was published in 2008 by a number of "leading scientists," including this author (Altman et al., 2008). Another overview of challenges was published by Dai et al. (2010). Some newer overviews include a new book (Jurafsky and Martin, 2008), a review focused on genomics and system biology (Harmston et al., 2010), a review on biomedical research and integrative biology (Rebholz-Schuchmann et al., 2012), and another focused on neuroscience (Ambert and Cohen, 2012). Neves (2014) described the various corpora available for biomedical text mining research, finding them more developed for genes, proteins, and chemicals than diseases, genomic variations, and mutations.

Another overview summarizes the major challenge evaluations in biomedical text mining
(Huang and Lu, 2016).

One concern of many scientists is publishers' reluctance to provide access to their content for text-mining activities (Jha, 2012).

One new resource that has been developed to aid biomedical text mining is the BioLexicon, a resource of terms with part-of-speech tagging, synonyms, and other information to help in biomedical text mining (Thompson et al., 2011). Another tool used MeSH indexing to build a resource aiding in word sense disambiguation (Jimeno-Yipes et al, 2011). Another new corpus, the Colorado Richly Annotated Full-Text (CRAFT) corpus provides great semantic diversity (Bada et al., 2012) and has uncovered many differences in commonly used NLP tools (Verspoor et al., 2012).

A variety of other work has pushed the field forward. One recent study found that nouns are not the only parts of speech that vary in natural language (Cohen et al., 2008). Verbs do as well, as they are both nominalized as well as alternation (i.e., changes in surface form with the same underlying meaning). Noting a disconnect between the biomedical literature and the data published in gene sequence databases, Baran et al. (2011) developed a tool called pubmed2ensembl that integrated gene sequences in the Ensembl resource with literature describing those sequences in PubMed and PubMed Central. Haeussler et al. (2011) describe a similar system that identifies gene sequences within articles and maps them to records in GenBank.

Altman, R., Bergman, C., et al. (2008). Text mining for biology - the way forward: opinions from leading scientists. Genome Biology, 9(Suppl 2): S7.
Dai, H., Chang, Y., et al. (2010). New challenges for biological text-mining in the next decade. Journal of Computer Science and Technology, 25: 169-179.
Jurafsky, D and Martin, JH (2008). Speech and Language Processing (2nd Edition). Upper Saddle River, NJ, Pearson Prentice Hall.
Harmston, N, Filsell, W, et al. (2010). What the papers say: text mining for genomics and systems biology. Human Genomics. 5: 17-29.
Rebholz-Schuhmann, D, Oellrich, A, et al. (2012). Text-mining solutions for biomedical research: enabling integrative biology. Nature Reviews Genetics. 13: 829-839.
Ambert, KH and Cohen, AM (2012). Text-mining and neuroscience. International Review of Neurobiology. 103: 109-132.
Neves, M (2014). An analysis on the entity annotations in biological corpora. F1000Research. 3: 96.
Huang, CC and Lu, Z (2016). Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics. 17: 132-144.
Jha, A. (2012). Text mining: what do publishers have against this hi-tech research tool? Secondary Text mining: what do publishers have against this hi-tech research tool? The Guardian.
Thompson, P., McNaught, J., et al. (2011). The BioLexicon: a large-scale terminological resource for biomedical text mining. BMC Bioinformatics, 12: 397.
Bada, M, Eckert, M, et al. (2012). Concept annotation in the CRAFT corpus. BMC Bioinformatics. 13: 161.
Verspoor, K, Cohen, KB, et al. (2012). A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 13: 207.
Cohen, K., Palmer, M., et al. (2008). Nominalization and alternations in biomedical language. PLoS ONE, 3(9): e3.
Baran, J., Gerner, M., et al. (2011). pubmed2ensembl: a resource for mining the biological literature on genes. PLoS ONE, 6(9): e24716.
Haeussler, M., Gerner, M., et al. (2011). Annotating genes and genomes with DNA sequences extracted from biomedical articles. Bioinformatics, 27: 980-986.
The Biocreative initiative has continued. Biocreative focuses mostly on developing tools for curation of literature, but some of its tasks cover text mining. Some recent text-mining tasks include:

Biocreative IV (2013)
  • Chemical and Drug Named Entity Recognition (CHEMDNER) - Detection of mentions of chemical compounds and drugs, in particular those chemical entity mentions that can subsequently be linked to a chemical structure (Krallinger et al., 2015; Krallinger et al., 2015)
  • Gene Ontology (GO) curation - Development of automatic methods to aid GO curators in identifying articles with curatable GO information (triage) and extracting gene function terms and the associated evidence sentences in full-length articles (Mao et al., 2014)
Biocreative V (2015)
  • CHEMDNER patents - Identification of chemical compounds and of relevant biological context in patents (Krallinger et al., 2015)
  • Chemical-disease relation (CDR) task - Automatic detection of chemical/drugs and diseases, and their relations in PubMed abstracts. In particular, the CDR task focuses on extracting the relationship of drug-induced diseases (Wei et al., 2016)
Biocreative VI (2017)
  • Mining protein interactions and mutations for precision medicine - identifying and extracting protein-protein interactions affected by mutations described in the biomedical literature
  • Text mining chemical-protein interactions
Leitner, F., Mardis, S., et al. (2010). An overview of BioCreative II.5. IEEE Transactions on Computational Biology and Bioinformatics, 7: 385-399.
Arighi, C., Lu, Z., et al. (2011). Overview of the BioCreative III Workshop. BMC Bioinformatics, 12(Suppl 8): S1.
Mao, Y, VanAuken, K, et al. (2014). Overview of the gene ontology task at BioCreative IV. Database. 2014: bau086.
Krallinger, M, Leitner, F, et al. (2015). CHEMDNER: The drugs and chemical names extraction challenge. Journal of Chemoinformatics. 19(7(Suppl 1 Text mining for chemistry and the CHEMDNER track)): S1.
Krallinger, M, Rabal, O, et al. (2015). The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of Chemoinformatics. 7((Suppl 1 Text mining for chemistry and the CHEMDNER track)): S2.
Krallinger, M, Rabal, O, et al. (2017). Information retrieval and text mining technologies for chemistry. Chemical Reviews: Epub ahead of print.
Wei, CH, Peng, Y, et al. (2016). Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database. 2016: baw032.
Another challenge evaluation in biological information extraction has been the BioNLP Shared Task (BioNLP-ST). There have been four BioNLP-ST events focused on event extraction from scientific literature, mostly using the Genia corpus:
  • 2009 - The main task was core event extraction, where participants were required to identify events concerning given proteins. Two additional optional events included event enrichment and negation and speculation recognition.
  • 2011 - The main task continued to be event extraction.
  • 2013 - The main task consisted of six types of event extraction: Genia Event Extraction for NFkB knowledge base construction, Cancer Genetics, Pathway Curation, Corpus Annotation with Gene Regulation Ontology, Gene Regulation Network in Bacteria, and Bacteria Biotopes.
  • 2016 - The main task consisted of three types of event extraction:Genetic and molecular mechanisms involved in plant seed development; Bacteria Biotope, bacteria locations and normalization with an ontology; and NFkB Knowledge base construction.

There continued to be many interesting applications of text categorization both in and out of health and biomedicine. Certainly a major biomedical application is selection of content categories. Ruau et al. (2011) showed that automated annotations of molecular datasets provided much more comprehensive (i.e., higher recall) assignments of MeSH terms than manual annotation. Automated annotation tools have also been shown to help humans doing manual annotation (Shatkay et al., 2008). Ruau, D., Mbagwu, M., et al. (2011). Comparison of automated and human assignment of MeSH terms on publicly-available molecular datasets. Journal of Biomedical Informatics, 44(Suppl 1): S39-S43.
Shatkay, H., Pan, F., et al. (2008). Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics, 24: 2086-2093.
Other work shows applicability of text categorization beyond categorizing content. Categorization approaches have also been investigated as a means to identify "claims" in scientific papers (Blake, 2010). An important finding of this study was that most claims were made in the body of the paper and not the abstract, indicating that systems showing only the abstract (i.e., PubMed) do not retrieve all the claims made in papers. Blake, C. (2010). Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles. Journal of Biomedical Informatics, 43: 173-189.
Text categorization has also been used to signal articles likely to need inclusion in updates of systematic reviews. Since publication of the book, Cohen et al. (2009) have continued to refine work in this area and most recently have shown the ability to identify about 70% of all new publications warranting inclusion in a systematic drug category reviews while maintaining an overall low alert rate (Cohen et al., 2012). Likewise, Dalal et al. (2013) have shown similar success with two other drug review categories. Cohen, AM, Ambert, K, et al. (2009). Cross-topic learning for work prioritization in systematic review creation and update. Journal of the American Medical Informatics Association. 16: 690-704.
Cohen, AM, Smalheiser, NR, et al. (2015). Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. Journal of the American Medical Informatics Association. 22: 707-717.

Dalal, SR, Shekelle, PG, et al. (2013). A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating. Medical Decision Making. 33: 343-355.
Other work focuses on queries to search engines, with the major application being in syndromic surveillance. A more recent development since the applications described in the book include the development of a system by Google called Flu Trends (Carneiro and Mylonakis, 2009). Early work found the system to perform well retroactively in predicting H1N1 influenza in the US (Cook et al., 2011) as well as rates of influenza and patient utilization in emergency departments (Dugas et al., 2012).

However, the system performed less well in the US 2012-2013 flu season (Butler, 2013). This led Lazer et al. (2014) to issue a warning against "big data hubris." Subsequent research found, however, that other approaches to flu prediction can outperform Google and its approach based on search queries. Additional data used includes flu data itself (Martin et al., 2014), selection of specific queries (Santillana, 2014), and EHR data (Yang et al., 2017).
Carneiro, HA and Mylonakis, E (2009). Google Trends: a web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases. 49: 1557-1564.
Cook, S, Conrad, C, et al. (2011). Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS ONE. 6(8): e23610.
Dugas, A., Hsieh, Y., et al. (2012). Google Flu Trends: correlation with emergency department influenza rates and crowding metrics. Clinical Infectious Diseases, 54: 463-469.
Butler, D (2013). When Google got flu wrong. Nature. 494: 155-156.
Lazer, D, Kennedy, R, et al. (2014). Big data. The parable of Google Flu: traps in big data analysis. Science. 343: 1203-1205.

Martin, LJ, Xu, B, et al. (2014). Improving Google Flu Trends Estimates for the United States through Transformation. PLoS ONE. 10(4): e0122939.
Santillana, M, Zhang, DW, et al. (2014). What can digital disease detection learn from (an external revision to) Google Flu Trends? American Journal of Preventive Medicine. 47: 341-347.
Yang, S, Santillana, M, et al. (2017). Using electronic health records and Internet search information for accurate influenza forecasting. BMC Infectious Disease. 17: 332.
Outside of medicine, a number of other categorization techniques have been shown to be beneficial. Going beyond detection of spam in email, Cormack et al. (2011) have also been shown the ability to identify "spam" in Web search engine output. Other research has focused on social media. For example, analysis of Twitter feeds has been found retrospectively to predict stock market trends (Bollen et al., 2010). Likewise, Facebook "Likes" have been shown to predict many personal attributes from gender and geographic location to sexual orientation and political views (Kosinski et al., 2013). This analysis also found as association between scores on intelligence tests and "liking" of the television show, The Colbert Report (something I Facebook-Liked before this study came out!).

Another study of Facebook users who also sought medical care found that more prolific Facebook posters tended to post about health conditions more frequently (Smith et al., 2017). There was also an association in this sample with Facebook posting and diagnosis of depression. Attributes of Twitter users have also been found to correlate with health conditions. Eichstead et al. (2015) found that language patterns reflecting negative social relationships, disengagement, and negative emotions were risk factors for atherosclerotic heart disease while positive emotions and psychological engagement were found to be protective factors, even after controlling for income and education. Hawkins et al. (2015) found that many patients tweeted about their experiences in hospitals but few strong associations with process or outcomes measures was discovered.

A more controversial application of text categorization is the automated grading of student essays on standardized tests. Shermis and Hammer (2012) claimed that automated methods were found to perform highly accurately on a gold standard of papers graded by humans. Perelman (2013), however, took strong exception to those claims, and it is likely that further research will need to sort out the specific role for automated high-stakes grading of writing.
Cormack, G., Smucker, M., et al. (2011). Efficient and effective spam filtering and re-ranking for large web datasets. Information Retrieval, 14: 441-465.
Bollen, J., Mao, H., et al. (2010). Twitter mood predicts the stock market. Journal of Computational Science, 2: 1-8.
Kosinski, M, Stillwell, D, et al. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences. 110: 5802-5805.
Smith, RJ, Crutchley, P, et al. (2017). Variations in Facebook posting patterns across validated patient health conditions: a prospective cohort study. Journal of Medical Internet Research. 19(1): e7.
Eichstaedt, JC, Schwartz, HA, et al. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological Science. 26: 159-169.
Hawkins, JB, Brownstein, JS, et al. (2015). Measuring patient-perceived quality of care in US hospitals using Twitter. BMJ Quality & Safety. 25: 404-413.
Shermis, MD and Hamner, B (2012). Contrasting State-of-the-Art Automated Scoring of Essays: Analysis. National Council on Measurement in Education, Vancouver, BC
Perelman, LC (2013). 
Critique (Ver. 3.4) of Mark D. Shermis & Ben Hammer, “Contrasting State-of-the-Art Automated Scoring of Essays: Analysis”. Cambridge, MA, Massachusetts Institute of Technology.

Of all the topics in this chapter, clearly question-answering has received the most attention from the media, mostly revolving around the IBM Watson system. Part of that attention stems from its potential role in medicine. Watson was actually developed out of IBM's participating in the TREC Question-Answering Track (Voorhees, 2005). A technical overview of the system has been described by Ferrucci et al. (2010). A great deal of further detail has been provided in an entire issue of IBM Journal of Research and Development (Ferrucci, 2012). As most know, Watson beat humans at the Jeopardy! television game show (Markoff, 2011) and is now being applied to healthcare (Lohr, 2012).

Watson is built around a system called DeepQA, which uses massively parallel computing to acquire knowledge from resources of a given domain (Ferrucci et al., 2010). Its learning process builds around sample questions from the domain. One key step is to identify lexical answer types (LATs) in the domain. Among general questions, some common LATs include he, country, city, man, film, state, she, author, group, here, company, etc. From these LATs, natural language processing (NLP) is applied to text and knowledge representation and reasoning (KRR) to structured knowledge. Machine learning is then applied to questions and their answers.

When questions are entered into Watson at run-time, the system searches performs question classification, aiming to detect LATs and focus the question. The process is aided by detection of relationships stated in the question as well as decomposition of questions into subparts. Watson then generates hypotheses for answers, performing a step called "soft filtering" to prune the possible list, and then sorting the remainder to rank and provide confidence in the answer. The parallel nature of its algorithms make it highly scalable.

Since winning at Jeopardy!, Watson has "graduated medical school" (Cerrato, 2012). To apply Watson to any new domain, including medicine, three areas of adaptation are required (Ferrucci et al., 2012):
  • Content adaptation - acquiring and modeling new content
  • Training adaptation - adding and learning from new question types
  • Functional adaptation - adapting question analysis, hypothesis scoring, and new functionality specific to the domain
In Watson's first foray into the medical domain, it was trained using several resources from internal medicine (discussed in earlier chapters), such as ACP Medicine, PIER, Merck Manual, and MKSAP. The concept adaptation process required not only named entity detection (e.g., disambiguation of terms and their senses), but also measure recognition and interpretation (e.g., age or blood test value) as well as recognition of unary relations (e.g., elevated <test result>). Watson was trained with 5000 questions from Doctor's Dilemma, a competition somewhat like Jeopardy!, which is run by American College of Physicians and in which medical trainees participate each year. A sample question is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided as well!).

Watson was evaluated on an additional 188 unseen questions. The primary outcome measure was recall at 10 answers, and the results varied from 0.49 for the core system to 0.77 for the fully adapted and trained system (Ferrucci, 2012). It would have been interesting to see Watson compared against other systems, such as Google or Pubmed, as well as assessed using other measures, such as MRR. A future use case for Watson is to apply the system to data in EHR systems, ultimately aiming to serve as a clinical decision support system (Cerrato, 2012).

Since the original published study, very little other peer-reviewed research has been published (Kim, 2015), although Watson is seen regularly on IBM television commercials and other marketing materials. A number of researchers, including a long-time artificial intelligence researcher, have been critical of claims for it (Schank, 2016).
Voorhees, E. (2005). Question Answering in TREC, 233-257, in Voorhees, E. and Harman, D., eds. TREC - Experiment and Evaluation in Information Retrieval. Cambridge, MA. MIT Press.
Ferrucci, D., Brown, E., et al. (2010). Building Watson: an overview of the DeepQA Project. AI Magazine, 31(3): 59-79.
Ferrucci, DA (2012). Introduction to "This is Watson". IBM Journal of Research and Development. 56(3/4): 1-15.

Markoff, J. (2011). Computer Wins on ‘Jeopardy!’: Trivial, It’s Not. New York Times. February 16, 2011.
Lohr, S. (2012). The Future of High-Tech Health Care - and the Challenge. New York Times. February 13, 2012.
Cerrato, P (2012). IBM Watson Finally Graduates Medical School. 
Information Week, October 23, 2012.
Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. 199-200: 93-105.
Kim, C (2015). How much has IBM’s Watson improved? Abstracts at 2015 ASCO. Health + Digital.
Schank, R (2016). The fraudulent claims made by IBM about Watson and AI. They are not doing "cognitive computing" no matter how many times they say they are. Roger Schank.
Some other question-answering systems for biomedicine have been developed, one that attempts to parse and map questions into facts determined from journal articles (Neves and Leser, 2015) and another that aims to find sentences that likely have the answer (Hristovski et al., 2015).

Question-answering has been part of the BioASQ initiative (Tsatsaronis et al., 2015).
Neves, M and Leser, U (2015). Question answering for Biology. Methods. 74: 36-46.
Hristovski, D, Dinevski, D, et al. (2015). Biomedical question answering using semantic relations. BMC Bioinformatics. 16: 6.
Tsatsaronis, G, Balikas, G, et al. (2015). An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics. 16: 138.
There continues to be work in text summarization in the biomedical domain, as exemplified by recent systematic reviews focused on its use with biomedical literature (Mishra et al., 2014) as well as EHRs (Pivovarov and Elhadad, 2015).

In individual systems, Workman et al. (2010) used a semantic retrieval system to develop a summarization tool for a consumer genetics reference. Rebholz-Schuhmann and colleagues (2010) developed Papermaker, an application that provides a summarization to help authors write manuscripts by suggesting controlled vocabulary terms, more consistent language, and correct and appropriate references. Plaza (2014) has assessed the value of different terminology systems for assisting summarization, finding that individual systems, as opposed to the UMLS Metathesaurus, provide the most value. Work by Del Fiol et al. (2016) and Slager et al. (2017) have evaluated a system designed to summarize reports of evidence for physicians, finding summarized reports leveraging the PICO format have been found to be preferred over original formats of the literature.

Text summarization work (along with question-answering and knowledge base population) has also continued in the 
Text Analysis Conference (TAC) sponsored by NIST. The 2017 cycle includes a task in information extraction of adverse drug reactions (ADRs) of drugs.
Mishra, R, Bian, J, et al. (2014). Text summarization in the biomedical domain: a systematic review of recent research. Journal of Biomedical Informatics. 52: 457-467.
Pivovarov, R and Elhadad, N (2015). Automated methods for the summarization of electronic health records. Journal of the American Medical Informatics Association. 22: 938–947.
Workman, T., Fiszman, M., et al. (2010). Biomedical text summarization to support genetic database curation: using Semantic MEDLINE to create a secondary database of genetic information. Journal of the Medical Library Association, 98: 273-281.
Rebholz-Schuhmann, D., Kavaliauskas, S., et al. (2010). PaperMaker: validation of biomedical scientific publications. Bioinformatics, 26: 982-984.
Plaza, L (2014). Comparing different knowledge sources for the automatic summarization of biomedical literature. Journal of Biomedical Informatics. 52: 319-328.
DelFiol, Guilherme, Mostafa, J, et al. (2016). Formative evaluation of a patient-specific clinical knowledge summarization tool. International Journal of Medical Informatics. 86: 126-134.
Slager, SL, Weir, CR, et al. (2017). Physicians' perception of alternative displays of clinical research evidence for clinical decision support - a study with case vignettes. Journal of Biomedical Informatics: Epub ahead of print.

Last updated May 21, 2017