Information Retrieval: A Health and Biomedical Perspective, Third Edition

William Hersh, M.D.

Chapter 8 Update

This update contains all new references cited in the author's OHSU BMI 514/614 course for Chapter 8.

The nature of IR research continues to change with the growing ubiquity of search tools such as PubMed and Google. It is difficult if not impossible to stand up a research system anything like them, and most newer research builds on top of them or other large-scale systems.

Another major change in IR research is the move beyond system-oriented evaluation of ad hoc retrieval systems, i.e., the traditional approach of seeing how well a batch approach to submitting queries against a test collection can perform. Some have argued that the ad hoc search problem is for the most part "solved," even though there are probably incremental improvements that can be made.

A number of recent general publications are noteworthy. Sanderson and Croft (2012) recently published a broad history of IR research from literally the beginning, i.e., from use of mechanical devices to modern-day advances. More specific to biomedical and health informatics but not limited to IR is the Statement on the Reporting of Evaluation Studies in Health Informatics (STARE-HI), which provides recommendations on the report of evaluative research in informatics (Talmon et al., 2009). A more theoretical and mathematical framework on evaluative research in IR has also been published by Carterette (2011).

Some new books summarize the state of the art. A book covering both text retrieval (search) and text mining (extraction) has been published by Zhai and Massung (2016). White (2016) has also published a book on the dynamic nature of Web search.

Several dozen computer science-oriented IR researchers came together in 2012 to inventory challenges and opportunities for the field (Allan, 2012). Six cross-cutting themes emerged across the many topics deemed to be critical:
  • Retrieval beyond ranked lists, such as enhanced methods for querying, interaction, and answering
  • Better support of users, including those who are inexperienced, illiterate, or disabled
  • Understanding and leveraging the context of the user's search
  • Moving beyond documents to retrieve more complex data and complicated results
  • Exploring domains, such as private data, richly connected workplace data, and collections of "apps"
  • Evaluation, especially in the context of the above new challenges

TREC recently celebrated its 25th anniversary. The entire event was captured on video, including a talk on the various Biomedical Tracks by this author (starting at about 50 minutes into the Part 3 video). Another way to see how IR evaluation has evolved is to look at the tracks from recent TREC conferences, which show incorporation of interactive searching as well as new tasks and content types.

Allan, J., Croft, B., et al. (2012). Frontiers, challenges and opportunities for information retrieval: report from SWIRL 2012. SIGIR Forum, 46(1): 2-32.
Carterette, B. (2011). System effectiveness, user models, and user utility: a conceptual framework for investigation. The 34th Annual ACM SIGIR Conference, Beijing, China. ACM Press. 903-912.
Sanderson, M. and Croft, W. (2012). The History of Information Retrieval Research. Proceedings of the IEEE, 100: 1444-1451.
Talmon, J., Ammenwerth, E., et al. (2009). STARE-HI--Statement on reporting of evaluation studies in health informatics. International Journal of Medical Informatics, 78: 1-9.
Zhai, C and Massung, S (2016). Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. New York, NY, Association for Computing Machinery.
White, RW (2016). Interactions with Search Systems. Cambirdge, England, Cambridge University Press.
With the emergence of large-scale commercial IR systems as well as new technologies, such as the Web and mobile devices, the nature of system-oriented IR research has changed from a focus on basic search systems and tasks (e.g., ad hoc retrieval) to systems for modern search tasks (e.g., Web retrieval, incremental search, contextual search, etc.).
A good deal of the research focus on pure lexical-statistical systems in recent years has been on statistical language models (Zhai, 2009) and machine learning methods for optimal ranking of system output (Li, 2011). For the latter, Microsoft Research has described (Qin et al., 2010) and made available for research a data collection called LETOR (Learning to Rank;

The language model approach is mathematically based on calculating the probability of a word being in a query given that it occurs in relevant document (Zhai and Massung, 2016). A "smoothing" function provides weighting for previously unseen words in queries.

Learning to rank approaches apply various machine learning features to word-level and document-level (e.g., link-based) features (Zhai and Massung, 2016). Despite all the successes of "deep learning" approaches in areas outside IR, minimal headway has been seen in improving results of document retrieval (Cohen, 2016). Recently, however, Denghani et al. (2017) used the large AOL query set and generated ranking of output from BM25 that provided “weakly supervised” training data and was used with the Robust and ClueWeb09 test collections. The resulting system was shown to achieve better results than straight BM25.

Can we still learn from failure analysis? A major analysis done in 2003 but not widely published until 2009 looked at a variety of high-performing systems from TREC to determine why they failed on various topics (Buckley, 2009). The results found that most systems failed for a given topic for mostly the same reasons. The most common reasons for failure to recognize or promote various aspects of a topic. Rarely was the relationship between terms important, but instead the aspects of topic and document terms were critical.

A number of well-known older research systems are no longer maintained:
  • SMART (Salton’s Magical Retiever of Text), maintained by Chris Buckley through last decade
  • Lemur – evolution to Indri
Although some are:
  • Indri - outgrowth of Lemur Toolkit
  • Terrier - focused on research and experimentation
  • Zettair - built to handle large document collections
By the same token, a number of early open-source search engines have not been maintained:
  • WAIS – developed by Thinking Machines to demonstrate high-end parallel-computing system and outlasted the hardware!
  • SWISH-E – long-time Web search engine
But some are widely used, including by IR researchers:
  • Lucene - originated as a general search engine for Web searching, enhanced with more features in Solr (Kuc, 2013) and ElasticSearch (Gormley and Tong, 2015)
  • Xapian - supports Boolean and probabilistic search
  • Sphinx - built on open-source business model
Zhai, CX (2009). Statistical Language Models for Information Retrieval, Morgan & Claypool Publishers.
Li, H (2011). Learning to Rank for Information Retrieval and Natural Language Processing, Morgan & Claypool Publishers.
Qin, T., Liu, T., et al. (2010). LETOR: a benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13: 346-374.
Cohen, D, Ai, Q, et al. (2016). Adaptability of neural networks on varying granularity IR tasks. Neu-IR'16 SIGIR Workshop on Neural Information Retrieval, Pisa, Italy
Dehghani, M, Zamani, H, et al. (2017). Neural ranking models with weak supervision. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR2017), Tokyo, Japan
Buckley, C. (2009). Why current IR engines fail. Information Retrieval, 12: 652-665.
Kuc, R (2013). Apache Solr 4 Cookbook. Birmingham, England, Packt Publishing.
Gormley, C and Tong, Z (2015). Elasticsearch: The Definitive Guide. Sebastopol, CA, O'Reilly & Associates.
The NLM's Lexical Systems Group maintains a site that supports the SPECIALIST Lexicon and natural language processing (NLP) tools for it ( A historical overview of the MetaMap system that maps text to Unified Medical Language System (UMLS) Metathesaurus controlled terms for a variety of IR (and other) applications has been published (Aronson and Lang, 2010).

The introductory text by Jackson and Moulinier (2007) has been updated. A more recent overview of text processing for Web applications has been published by Ingersoll et al. (2013). An overview of NLP for health and biomedical applications was published by Friedman and Elhadad (2014). The book referenced above by Zhai and Massung (2016) also provides an overview of NLP and its use in search.

A sample parser is available from the Stanford Natural Language Processing Group.
Aronson, A. and Lang, F. (2010). An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17: 229-236.
Jackson, P and Moulinier, I (2007). Natural Language Processing for Online Applications: Text retrieval, Extraction and Categorization, Second Revised Edition. Amsterdam, Holland, John Benjamins Publishing Company.
Ingersoll, GS, Morton, TS, et al. (2013). Taming Text - How to Find, Organize, and Manipulate It. Shelter Island, NY, Manning Publications.
Friedman, C and Elhadad, N (2014). Natural Language Processing in Health Care and Biomedicine. Biomedical Informatics: Computer Applications in Health Care and Biomedicine (Fourth Edition). E. Shortliffe and J. Cimino. London, England, Springer: 255-284.
New applications of IR attract research interest, especially with development of new technologies (e.g., mobile devices) and new types of information (e.g., social media).
The CLEF initiative has been renamed to the Conference and Labs of the Evaluation Forum from the Cross-Language Evaluation Forum, although it maintains its focus on multilingual IR. It has a new URL: The ImageCLEF track also continues to thrive, including its medical image search task, which is described in Section An update on the state of the art in cross-language IR was recently provided by Nie (2010). Nie, JY (2010). Cross-Language Information Retrieval, Morgan & Claypool Publishers.
The TREC Web Track and a number of derived TREC tracks have focused on various aspects of Web searching.  These tracks have benefited from the creation of new large test collections built by extensive Web crawling. After a hiatus for a number of years, the TREC Web Track reemerged in 2009 with the development of the ClueWeb09 collection ( This collection consists of 1.04 billion Web pages in ten different languages. Its size is 5 TB compressed and 25 TB uncompressed. The collection contains about 4.8 billion unique URLs (some mapping to the same page) and 7.9 billion total outlinks.

The TREC 2011 (Clarke et al., 2011) and 2012 Web Tracks (Clarke et al., 2012) featured two tasks. One was a standard ad hoc retrieval task. Relevance judging was carried out on the following scale (Clarke et al., 2011):
  1. Nav - page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site (relevance grade 4)
  2. Key - page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine (relevance grade 3)
  3. HRel - content of page provides substantial information on the topic (relevance grade 3)
  4. Rel - content of  page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page (relevance grade 1)
  5. Non - content of page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query (relevance grade 0)
  6. Junk - page does not appear to be useful for any reasonable purpose; it may be spam or junk (relevance grade -2)
The primary retrieval effectiveness measure used for this task was expected reciprocal rank (ERR) (Chapelle, 2009), although normalized distributed cumulative gain (NDCG) and mean average precision (MAP) were measured as well.

The second task of the Web Track has been a diversity task, where the retrieval goal has been for the search output to contain complete coverage of the topic without excessive redundancy. Each topic is divided into subtopics, with each document judged for relevance to each subtopic. A variant of ERR, intent aware ERR (ERR-IA) is the primary performance measure, although NDCG is used as well.

There have also been some derivations of the Web track using the same collection. TREC 2011 featured the Session Track, which focused on retrieval over multiple sessions. A set of sessions were provided that contained the current query along with the set of past queries; the ranked list of URLs for each past query; and the set of clicked URLs and snippets, along with time spent reading, for each clicked URL (Kanoulas et al., 2011). TREC 2012 featured the Contextual Suggestion Track, where the task was to retrieve documents relative to a user's context (location, date, time of year, etc.) and interests (Dean-Hall et al., 2012).
Clarke, CLA, Craswell, N, et al. (2011). Overview of the TREC 2011 Web Track. The Twentieth Text REtrieval Conference (TREC 2011) Proceedings, Gaithersburg, MD. National Institute of Standards and Technology
Clarke, CLA, Craswell, N, et al. (2012). Overview of the TREC 2012 Web Track. The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings, Gaithersburg, MD. National Institute of Standards and Technology
Chapelle, O, Metlzer, D, et al. (2009). Expected reciprocal rank for graded relevance. 18th ACM Conference on Information and Knowledge Management, Hong Kong, China. 621-630.
Kanoulas, E, Hall, M, et al. (2011). Overview of the TREC 2011 Session Track. The Twentieth Text REtrieval Conference (TREC 2011) Proceedings, Gaithersburg, MD. National Insitute of Standards and Technology
Dean-Hall, A, Clarke, CLA, et al. (2012). Overview of the TREC 2012 Contextual Suggestion Track. The Twenty-First Text REtrieval Conference (TREC 2012) Proceedings, Gaithersburg, MD. National Insitute of Standards and Technology
A new version of the Web test collection was created to supersede ClueWeb09. ClueWeb12 ( contains about 870 million English-language Web pages. The crawl captured all textual pages (including XML, Javascript, and CSS pages) and images, but ignored multimedia (e.g., Flash) and compressed (e.g., Zip) files. It also included all URLs in Twitter feeds. It has been used in a number of TREC tracks.

Another large new collection that has been developed is the Knowledge Base Acceleration (KBA) Stream Corpus (, which contains a variety of Web-based resources that are meant to be used in KBA tasks that aim to accelerate the discovery of knowledge.
Some new TREC tracks in recent years demonstrate applications of IR systems and include:
  • Knowledge Base Acceleration ( - filter a stream of documents to accelerate users' filling in knowledge gaps
  • Real Time Summary - of social media (Twitter) content
  • Temporal Summarization - monitor events over time, using KBA Stream Corpus
  • Tasks ( - infer underling tasks of users, using ClueWeb12
  • Dynamic Domain ( - simulate dynamic, exploratory search within complex information domains, with goal of using feedback to improve search but also know when to stop, using KBA Stream Corpus
  • Total Recall ( - aim for complete recall on appropriate tasks (e.g., legal), with some data highly private
    • Some relevance to medical tasks: systematic reviews and search over medical records
Another type of retrieval that has become commercially important is recommender systems, which can be viewed in some ways as an outgrowth of filtering. These are important for many e-commerce Web sites, such as Amazon and Netflix. There are two types of filtering:
  • Content-based filtering - find items the user likes and get more
  • Collaborative filtering - find similar users and show items those users like
Several new biomedical challenge evaluations have emerged since publication of the third edition of the book. One is the TREC Medical Records Track that was introduced in Chapter 1 (Voorhees and Tong, 2011; Voorhees and Hersh, 2012; Voorhees, 2013). The use case for the track TREC Medical Records Track was identifying patients from a collection of medical records who might be candidates for clinical studies. This is a real-world task for which automated retrieval systems could greatly aid in ability to carry out clinical research, quality measurement and improvement, or other "secondary uses" of clinical data (Safran et al., 2007). The metric used to measure systems employed was inferred normalized distributed cumulative gain (infNDCG), which takes into account some other factors, such as incomplete judgment of all documents retrieval by all research groups.

The data for the track was a corpus of de-identified medical records developed by the University of Pittsburgh Medical Center. Records containing data, text, and ICD-9 codes are grouped by "visits" or patient encounters with the health system. (Due to the de-identification process, it was impossible to know whether one or more visits might emanate from the same patient.) There were 93,551 documents mapped into 17,264 visits.

In the 2012 track, the best manual results were reported by Dember-Fushman et al. (2012; infNDCG = 0.680) and Bedrick et al. (2012; infNDCG = 0.526), while the best automated results were reported by Zhu and Carterette (2012; infNDCG = 0.578) and Qi and Laquerre (2012; infNDCG = 0.547).

A number of research groups used a variety of techniques, such as synonym and query expansion, machine learning algorithms, and matching against ICD-9 codes, but still had results that were not better than manually constructed queries employed by groups from NLM or OHSU (although the NLM system had a number of advanced features, such as document field searching). Although the performance of systems in the track was "good" from an IR standpoint, they also showed that identification of patient cohorts would be a challenging task even for automated systems. Some of the automated features that had variable success included document section focusing, and term expansion, term normalization (mapping into controlled terms).

Follow-on studies reported benefit from various approaches to query expansion:
  • Addition of ICD-9 codes to queries improved results (Amini et al., 2016)
  • Adding terms from other resources improved performance (Zhu et al., 2016)
  • Adding additional non-synonym terms from the UMLS Metathesaurus improved results (Martinez et al., 2016)
A failure analysis over the data from the 2011 track demonstrated why there are still many challenges that need to be overcome (Edinger, 2012). This analysis found a number of reasons why visits frequently retrieved were not relevant:
  • Notes contain very similar term confused with topic
  • Topic symptom/condition/procedure done in the past
  • Most, but not all, criteria present
  • All criteria present but not in the time/sequence specified by the topic description
  • Topic terms mentioned as future possibility
  • Topic terms not present--can't determine why record was captured
  • Irrelevant reference in record to topic terms
  • Topic terms denied or ruled out
The analysis also found reasons why visits rarely retrieval were actually relevant:
  • Topic terms present in record but overlooked in search
  • Visit notes used a synonym for topic terms
  • Topic terms not named and must be derived
  • Topic terms present in diagnosis list but not visit notes
Voorhees, EM and Tong, RM (2011). Overview of the TREC 2011 Medical Records Track. The Twentieth Text REtrieval Conference Proceedings (TREC 2011), Gaithersburg, MD. National Institute of Standards and Technology.
Voorhees, E and Hersh, W (2012). Overview of the TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute of Standards and Technology
Voorhees, EM (2013). The TREC Medical Records Track. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, Washington, DC. 239-246.
Safran, C, Bloomrosen, M, et al. (2007). Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. Journal of the American Medical Informatics Association. 14: 1-9.
Demner-Fushman, D, Abhyankar, S, et al. (2012). NLM at TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology
Bedrick, S, Edinger, T, et al. (2012). Identifying patients for clinical studies from electronic health records: TREC 2012 Medical Records Track at OHSU. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology
Zhu, D and Carterette, B (2012). Exploring evidence aggregation methods and external expansion sources for medical record search. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology
Qi, Y and Laquerre, PF (2012). Retrieving medical records with "sennamed": NEC Labs America at TREC 2012 Medical Record Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standard and Technology
Amini, I, Martinez, D, et al. (2016). Improving patient record search: a meta-data based approach. Information Processing & Management. 52: 258-272.
Zhu, D, Wu, ST, et al. (2014). Using large clinical corpora for query expansion in text-based cohort identification. Journal of Biomedical Informatics. 49: 275-281.
Martinez, D, Otegi, A, et al. (2014). Improving search over electronic health records using UMLS-based query expansion through random walks. Journal of Biomedical Informatics. 51: 100-106.
Edinger, T, Cohen, AM, et al. (2012). Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. AMIA 2012 Annual Symposium, Chicago, IL. 180-188.
There are some unique challenges for IR research with medical records. One of these, not limited to IR (e.g., also including NLP, machine learning, etc.) is the privacy of the patients whose records are being searched (Friedman, 2013). Given the growing concern over privacy and confidentiality, how can informatics (including IR) researchers carry out this work while assuring no information about patients is revealed? It turns out that this problem is not limited to patient records. There are many private collections of information over which we might like to search, such as email or corporate repositories. The TREC Total Recall Track addressed this issue and developed an architecture that involved sending systems to the data (Roegeist, 2016). Hanbury et al. (2016) expanded on this notion for medical IR, dubbing it Evaluation as a Service (EaaS). One limitation to this approach is that researchers run their systems on the data (securely somewhere else) so do not see the data, and only get results of their runs.
Friedman, C, Rindflesch, TC, et al. (2013). Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. Journal of Biomedical Informatics. 46: 765-773.
Roegiest, A and Cormack, GV (2016). An architecture for privacy-preserving and replicable high-recall retrieval experiments. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy. 1085-1088.
Hanbury, A, Müller, H, et al. (2015). Evaluation-as-a-Service: Overview and Outlook, arXiv.
The most recent biomedical track in TREC is the Clinical Decision Support (CDS) Track ( (Roberts, 2016; Roberts, 2016). In this track, the topic is a clinical case and the task is to retrieve full-text journal articles that provide relevant information on diagnosis, tests, or treatments. The task is ad hoc searching and the collection has been a snapshot of 733K (2014-2015) and 1.25M (2016) full-text articles from PubMed Central. This TREC track is morphing into the Precision Medicine Track starting in 2017 that will use MEDLINE abstracts and documents from
Roberts, K, Simpson, M, et al. (2016). State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track. Information Retrieval Journal. 19: 113-148.
Roberts, K, Demner-Fushman, D, et al. (2016). Overview of the TREC 2016 Clinical Decision Support Track. The Twenty-Fifth Text REtrieval Conference (TREC 2016) Proceedings, Gaithersburg, MD
Another biomedical challenge evaluation growing out of CLEF is the CLEF eHealth Evaluation Lab ( This challenge evaluation has focused on three tasks:
  1. Information extraction, e.g., named entity recognition and normalization of disorders
  2. Information management, e.g., eHealth data visualization, medical reports management
  3. Information retrieval by patients
Of the three CLEF eHealth Evaluation Lab tasks, only the third (Patient-Centered Information Retrieval) has been focused on IR. In the 2013-2014 tasks, the document collection consisted of a Web crawl that includes about one million documents. In the 2015 and on tasks, a subset of the ClueWeb12 collection, called B13 and containing about 50 million pages, has been used. The queries and qrels from these tasks have also been made available (
Research in biomedical text searching has not been limited to challenge evaluations. Sometimes users wish to find documents that report negative results, i.e., that use negation. Agrawal et al. (2010) have developed an algorithm, available on a Web site (, that aims to single out negated sentences. Agarwal, S., Yu, H., et al. (2010). BioNØT: a searchable database of biomedical negated sentences. BMC Bioinformatics, 12: 420.
Another area of focus has been on tools to assist those performing systematic reviews that make the retrieval and analysis of evidence (typically randomized controlled trials, RCTs). A search engine focused on this task has been developed by Smalheiser et al. (2014) and is called Metta. Cohen et al. (2015) have developed machine learning approaches that improve the tagging and retrieval of RCTs. Others have also used machine learning to recognize from the stream of new literature high-quality evidence in clinical studies (Kilicoglu, 2009) as well as articles about modalities of molecular medicine (Wehbe, 2009). Paynter et al. (2016) recently looked at evaluation studies of text mining approaches to assist systematic reviews.
Smalheiser, NR, Lin, C, et al. (2014). Design and implementation of Metta, a metasearch engine for biomedical literature retrieval intended for systematic reviewers. Health Information Science and Systems. 2014(2): 1.
Cohen, AM, Smalheiser, NR, et al. (2015). Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. Journal of the American Medical Informatics Association. 22: 707-717.
Kilicoglu, H., Demner-Fushman, D., et al. (2009). Towards automatic recognition of scientifically rigorous clinical research evidence. Journal of the American Medical Informatics Association, 16: 25-31.
Wehbe, F., Brown, S., et al. (2009). A novel information retrieval model for high-throughput molecular medicine modalities. Cancer Informatics, 8: 1-17.
Paynter, R, Bañez, LL, et al. (2016). EPC Methods: An Exploration of the Use of Text-Mining Software in Systematic Reviews. Rockville, MD, Agency for Healthcare Research and Quality.
Some research has focused on improving the ranking of documents output by the IR system. Essie is a concept-based system developed at the NLM whose main feature is to expand user queries by mapping them into concepts in the UMLS Metathesaurus (Ide et al., 2007). Evaluation with the TREC 2006 Genomics Track test collection showed results equal to the best-performing systems from the challenge evaluation. Another NLM research group compared a variety of document ranking strategies for TREC Genomics Track data, finding that TF*IDF ranking outperformed sentence-level co-occurrence of words and other approaches (Lu et al., 2009). Additional work with the TREC Genomics Track data found that language model approaches increased MAP by 174% over standard TFIDF (MAP = 0.381) (Abdou and Savoy, 2008). Including MeSH terms in the documents was found to increase MAP by 8.4%, while query expansion also gave small additional benefit. The TREC Genomics Track archive is now at:
Ide, N., Loane, R., et al. (2007). Essie: a concept-based search engine for structured biomedical text. Journal of the American Medical Informatics Association, 14: 253-263.
Lu, Z., Kim, W., et al. (2009). Evaluating relevance ranking strategies for MEDLINE retrieval. Journal of the American Medical Informatics Association, 16: 32-36.
Abdou, S and Savoy, J (2008). Searching in MEDLINE: query expansion and manual indexing evaluation. Information Processing & Management. 44: 781-799.
Other recent research has focused on methods for improving the assignment of Medical Subject Headings (MeSH) terms by automated means. Trieschnigg et al. (2009) introduced an approach called MeSH Up that uses a machine-learning classification approach called k-nearest neighbor (KNN). Using MEDLINE records from the TREC Genomics Track, the authors show this technique performs better than MetaMap, the NLM's Medical Text Indexer (MTI), and other concept-oriented approaches. Other approaches to this task have focused on identifying similar articles to leverage their MeSH terms. Aljaber et al.(2011) devised and evaluated an approach that gathers MeSH terms from articles that the paper being indexed cites and ranks them based on if and where they occur in the paper being indexed. Experiments showed improved performance over MTI and MeSH Up. Huang et al. (2011) used a "learning to rank" approach applied to similar documents. Another approach demonstrated an improvement with inserting a graph-based ranking approach of MeSH terms into the MTI process (Herskovic, 2011).

The European Commission and NLM have organized a challenge evaluation to assess semantic indexing of literature called BioASQ ( (Tsatsaronis et al., 2015). An additional task in BioASQ focuses on question-answering and is covered in Chapter 9. One analysis from this effort from the NLM found that the ten-year-old Medical Text Indexer (MTI) of NLM was still useful and could perhaps be augmented with machine learning approaches (Mork et al., 2017). The study also found that assisted indexing using MTI tended to perform better with precision tasks than recall tasks. An updated overview of the MTI has been published by Mork et al. (2013).
Trieschnigg, D., Pezik, P., et al. (2009). MeSH Up: effective MeSH text classification for improved document retrieval. Bioinformatics, 25: 1412-1418.
Aljaber, B., Martinez, D., et al. (2011). Improving MeSH classification of biomedical articles using citation contexts. Journal of Biomedical Informatics, 44: 881-896.
Huang, M., Neveol, A., et al. (2011). Recommending MeSH terms for annotating biomedical articles. Journal of the American Medical Informatics Association, 18: 660-667.
Herskovic, J., Cohen, T., et al. (2011). MEDRank: using graph-based concept ranking to index biomedical texts. International Journal of Medical Informatics, 80: 431-441.
Tsatsaronis, G, Balikas, G, et al. (2015). An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics. 16: 138.
Mork, J, Aronson, A, et al. (2017). 12 years on - is the NLM medical text indexer still useful and relevant? Journal of Biomedical Semantics. 2017(8): 8.
Mork, J, JimenoYepes, A, et al. (2013). The NLM medical text indexer system for indexing biomedical literature. 2013. BioASQ Workshop, Valencia, Spain
Another important area of biomedical IR that has spawned a new challenge evaluation concerns data set retrieval. This work is based on the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), a database of metadata about data sets available online.

The bioCADDIE 2016 Dataset Retrieval Challenge used a snapshot of bioCADDIE and 30 queries to form a challenge evaluation. The best results came from term-based query expansion, usually employing aspects of MeSH. A number of groups used machine learning approaches, although these may have been limited by the small amount of training data that had been made available.
The notional of the TREC Contextual Suggestion Track has been applied to medicine, with a mobile app making suggestions in the context of a user's health, such as healthy activities and eating (Wing and Yang, 2016).
Wing, C and Yang, H (2014). FitYou: integrating health profiles to real-time contextual suggestion. Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), Gold Coast, Australia. 1263-1264.
Another line of work has focused on analysis of user search logs to understand aspects of health-related searching. Early work focused on processing search logs, mostly from Microsoft Bing, to understand users' characteristics and intentions (Cartright et al., 2011; White and Horvitz, 2012). This approach has uncovered "cyberchondria," defined as unnecessary escalation of health-related concern when searching (White and Hrovitz, 2009), and "Web-scale pharmacovigilance," the uncovering of drug interactions from search logs (White et al., 2013; Nguyen et al., 2016). This approach has also been used to identify patients who have a higher likelihood to develop pancreatic carcinoma (Paparrizos et al., 2016) and lung carcinoma (White and Horvitz, 2017). Some limitations of these approaches are the retrospective nature of the data and the inferring of user actions and intent solely from the search logs.
Cartright, MA, White, RW, et al. (2011). Intentions and attention in exploratory health search. Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), Beijing, China. 65-74.
White, RW and Horvitz, E (2012). Studies of the onset and persistence of medical concerns in search logs. Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2012), Portland, OR. 265-274.
White, RW and Horvitz, E (2009). Cyberchondria: studies of the escalation of medical concerns in Web search. ACM Transactions on Information Systems. 4: 23-37.
White, RW, Tatonetti, NP, et al. (2013). Web-scale pharmacovigilance: listening to signals from the crowd. Journal of the American Medical Informatics Association. 20: 404-408.
Nguyen, T, Larsen, ME, et al. (2017). Estimation of the prevalence of adverse drug reactions from social media. Journal of Biomedical Informatics. 102: 130-137.
Paparrizos, J, White, RW, et al. (2016). Screening for pancreatic adenocarcinoma using signals from web search logs: feasibility study and results. Journal of Oncology Practice. 12: 737-744.
White, RW and Horvitz, E (2017). Evaluation of the feasibility of screening patients for early signs of lung carcinoma in Web search logs. JAMA Oncology. 3: 398-401.
Work in medical image retrieval continues to attract interest. An overview paper was published by Hersh et al. (2009) describing the consolidation of the ImageCLEFmed test collections from 2005-2007. A book describing all of ImageCLEF, not just the medical task, has also been published (Müller et al., 2010). More recently, a ten-year overview of lessons learned from the ImageCLEFmed tasks was published (Kalpathy-Cramer, 2015). These lessons included:
  • Text retrieval works better overall than visual retrieval
  • Visual retrieval can be effective for highly precise tasks, such as modality detection and others with small numbers of classes
  • Visual retrieval can also give high early precision, whereas text retrieval sometimes does not
  • Fusion of text and visual approaches is only occasionally helpful and must be done with care
  • Mapping free text queries to controlled terminologies can be helpful
The Web site for ImageCLEF, including the medical tracks, is

Since 2009, ImageCLEFmed has added two additional tasks beyond the basic ad hoc retreival task. One of these is modality classification, where systems attempt to identify the image modality (e.g., radiologic image, computerized tomography, publication figures, photographs, etc.). In the early years of this task, a small number (6-8) of modality categories were used. In 2012, however, a larger classification was developed that included over 20 items (Müller, 2012). With the smaller, earlier classification set, mixed text and visual retrieval methods worked best, but with the more recent newer and larger classification, however, visual retrieval methods have worked best, with text-based approaches alone performing poorly (Müller et al., 2012).

A second additional task has been case-based retrieval. In this task, given a case description with patient demographics, limited symptoms and test results including imaging studies (but not the final diagnosis), systems must retrieve cases that include images that best suit the provided case description (Kalpathy-Cramer et al., 2010). Similar to ad hoc retrieval, best results have come from textual and not visual queries (Kalpathy-Cramer et al., 2011; Müller et al., 2012).

Other image retrieval work outside ImageCLEF has focused on retrieval of journal images based on captions (Yu et al., 2009; Kahn and Rubin, 2009) and on annotation with the goal of improving retrieval (Demner-Fushman et al., 2009).
Hersh, W., Müller, H., et al. (2009). The ImageCLEFmed medical image retrieval task test collection. Journal of Digital Imaging, 22: 648-655.
Müller, H., Clough, P., et al., eds. (2010). ImageCLEF: Experimental Evaluation in Visual Information Retrieval. Heidelberg, Germany. Springer.
Kalpathy-Cramer, J, Secode Herrera, AG, et al. (2015). Evaluating performance of biomedical image retrieval systems - an overview of the medical image retrieval task at ImageCLEF 2004–2013. Computerized Medical Imaging and Graphics. 39: 55-61.
Müller, H, Seco De Herrera, AG, et al. (2012). Overview of the ImageCLEF 2012 medical image retrieval and classification tasks. CLEF 2012 Working Notes, Rome, Italy
Müller, H, Kalpathy-Cramer, J, et al. (2012). Creating a classification of image types in the medical literature for visual categorization. Medical Imaging 2012: Advanced PACS-based Imaging Informatics and Therapeutic Applications, San Diego, CA. SPIE
Kalpathy-Cramer, J., Bedrick, S., et al. (2010). Retrieving similar cases from the medical literature - the ImageCLEF experience. MEDINFO 2010, Cape Town, South Africa. 1189-1193.
Kalpathy–Cramer, J, Müller, H, et al. (2011). Overview of the CLEF 2011 medical image classification and retrieval tasks. CLEF 2011 Labs and Workshops Notebook Papers, Amsterdam, Netherlands
Yu, H., Agarwal, S., et al. (2009). Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension. Journal of Biomedical Discovery and Collaboration, 4: 1.
Kahn, C. and Rubin, D. (2009). Automated semantic indexing of figure captions to improve radiology image retrieval. Journal of the American Medical Informatics Association, 16: 380-386.
Demner-Fushman, D., Antani, S., et al. (2009). Annotation and retrieval of clinically relevant images. International Journal of Medical Informatics, 78: e59-e67.
In 2007, veteran Microsoft IR researcher Susan Dumais famously said, "If in 10 years we are still using a rectangular box and a list of results, I should be fired" (Markoff, 2007). She provided a vision for "thinking outside the (search) box" in 2009. Another user-oriented research offered up another vision of what "natural" search interfaces might look like in the future, with the user speaking rather than typing, viewing video rather than reading text, and interacting socially rather than alone (Hearst, 2011).

Since the book was published, other books about search user interfaces state of the art and research have been published (Hearst, 2009; Wilson, 2012). In addition, the volume by Shneiderman and colleagues (2009) is now in its fifth edition. Neilsen's famous all-time list of problems has a new URL: The page with links to his prolific writings is now at
Markoff, J (2007). Searching for Michael Jordan? Microsoft Wants a Better Way. New York, NY. New York Times. March 7, 2007.
Dumais, S. (2009). Thinking outside the (search) box. User Modeling, Adaptation, and Personalization, 17th International Conference, UMAP 2009 Proceedings, Trento, Italy. 2.
Hearst, M. (2009). Search User Interfaces. Cambridge, England, Cambridge University Press.
Hearst, M. (2011). 'Natural' search user interfaces. Communications of the ACM, 54(11): 60-67.
Shneiderman, B, Plaisant, C, et al. (2009). Designing the User Interface: Strategies for Effective Human-Computer Interaction, 5th Edition. Reading, MA, Addison-Wesley.
Wilson, ML (2012). Search User Interface Design, Morgan & Claypool Publishers.
As noted above, NLM has enhanced the Medical Text Indexer (MTI) (Mork et al., 2013). More recent additions include a machine learning module for selection of check tags and a "first-line" status for journals where the automated process is the only process used. NLM recently reassessed the  system after 12 years of use, finding that it still provided value as a tool to aid MeSH term selection by indexers (Mork et al., 2017). The inter-indexer consistency of MTI is comparable to humans.
Mork, J, JimenoYepes, A, et al. (2013). The NLM medical text indexer system for indexing biomedical literature. BioASQ Workshop, Valencia, Spain
Mork, J, Aronson, A, et al. (2017). 12 years on - is the NLM medical text indexer still useful and relevant? Journal of Biomedical Semantics. 2017(8): 8.
A number of systems over the years have attempted to cluster results for users, few of which have been evaluated with real users. Mu et al. (2011) went further and evaluated such a system with users, finding fewer clicks needed to navigate to relevant information with the clustering system. User satisfaction was also rated higher for the clustering system.
Mu, X., Ryu, H., et al. (2011). Supporting effective health and biomedical information retrieval and navigation: a novel facet view interface evaluation. Journal of Biomedical Informatics, 44: 576-586.
Both CLEF and TREC have incorporated interactive retrieval evaluation among their recent tracks. CLEF began with a Living Labs Track that provided commonly asked queries to users in real-time, with different systems and their features substituted as part of the study protocol. The CLEF track focused on product search and Web search, while the TREC Open Search Track has focused on academic search.
Schuth, A, Balog, K, et al. (2015). Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015. CLEF Proceedings 2015

Last updated May 14, 2017