Information Retrieval: A Health and Biomedical Perspective, Third Edition

William Hersh, M.D.

Chapter 1 Update

This update contains all new references cited in the author's OHSU BMI 514/614 course for Chapter 1.

In Stage 2 of achieving meaningful use objectives, one of the menu (choose three of six) objectives is to use electronic notes for 30% of all patients (Metzger, 2012). In order to be able to count a document as an electronic note, there must be the ability to search within the text of the note (HHS, 2012).
Anonymous (2012). Test Procedure for §170.314(a)(9) Electronic notes. Washington, DC, Department of Health and Human Services.
Metzger, J and Rhoads, J (2012). Summary of Key Provisions in Final Rule for Stage 2 HITECH Meaningful Use. Falls Church, VA, Computer Sciences Corp.
Google Trends (formerly Zeitgeist) - what people are searching
Size of digital universe - latest report by EMC estimates 4.4 zettabytes in 2013, growing to 44 zettabytes by 2020. Healthcare data growing even faster, estimated to be 2.2 zettabytes by 2020. Hilbert and Lopez estimated a total of 176 exabytes of information in 2007, with amount of analog information peaking around 2000 and all subsequent growth occurring with digital data. Short et al. (2011) looked at the amount of information believed to be processed on servers in 2008, estimating 9.57 zettabytes, which works out to about 12 gigabytes per day for the average worker.
Anonymous (2014). The Digital Universe of Opportunities. Hopkinton, MA, EMC Corp.
Anonymous (2014). The Digital Universe Driving Data Growth in Healthcare. Hopkinton, MA, EMC Corp.
Hilbert, M. and López, P. (2011). The world's technological capacity to store, communicate, and compute information. Science, 332: 60-65.
Short, J, Bohn, RE, et al. (2011). How Much Information? 2010 Report on Enterprise Server Information. San Diego, CA, University of California, San Diego.
In addition to information overload and data smog is "information chaos."
Beasley, J., T. Wetterneck, J. Temte, J. Lapin, P. Smith, A. Rivera-Rodriguez and B. Karsh (2011). Information chaos in primary care: implications for physician performance and patient safety. Journal of the American Board of Family Medicine. 24: 745-751.
Another growing concern is "health information literacy," whose need is increased in this era of proliferating online health information (Schardt, 2011), which consists of the following skills:
  • Recognize a health information need
  • Identify likely information sources and use them to retrieve relevant information
  • Assess the quality of the information and its applicability to a specific situation
  • Analyze, understand, and use the information to make good health decisions
The US Institute of Medicine has held numerous workshops on health literacy and numeracy, the most recent of which was in 2014 (French, 2014). A growing population of concern for health-related content on the Web is aging baby boomers, who have the technology skills of those who are younger and are increasingly development the health problems of those who are older (LeRouge, 2014).
Schardt, C (2011). Health information literacy meets evidence-based practice. Journal of the Medical Library Association. 99: 1-2.
French, MG (2014). Health Literacy and Numeracy: Workshop Summary (2014). Washington, DC, National Academies Press.
LeRouge, C, VanSlyke, C, et al. (2014). Baby boomers’ adoption of consumer health technologies: survey on readiness and barriers. Journal of Medical Internet Research. 16(9): e200.
After 32 years of leadership from Donald A.B. Lindberg, MD, the National Library of Medicine named Patricia Flatley Brennan, RN, PhD as its fourth Director in 2016.
Newer books since 3rd edition
Aggarwall, C. and Zhai, C., eds. (2010). Mining Text Data. Boston, MA. Springer.
Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval: The Concepts and Technology behind Search (2nd Edition). Reading, MA. Addison-Wesley.
Buttcher, S., Clarke, C., et al. (2010). Information Retrieval: Implementing and Evaluating Search Engines. Cambridge, MA. MIT Press.
Chowdhurry, G. (2010). Introduction to Modern Information Retrieval, 3rd Edition. New York, NY, Neal-Schuman Publishers.
Connaway, LS, Radford, ML, et al. (2016). Research Methods in Library and Information Science. Santa Barbara, CA, ABC-CLIO.
Croft, W., Metzler, D., et al. (2009). Search Engines: Information Retrieval in Practice. Boston, MA. Addison-Wesley.
Goker, A. and Davies, J. (2009). Information Retrieval: Searching in the 21st Century. Hoboken, NJ. Wiley.
Gormley, C and Tong, Z (2015). Elasticsearch: The Definitive Guide. Sebastopol, CA, O'Reilly & Associates.
Hearst, M. (2009). Search User Interfaces. Cambridge, England. Cambridge University Press.
Ingersoll, G., T. Morton and A. Farris (2013). Taming Text - How to Find, Organize, and Manipulate It. Shelter Island, NY, Manning Publications.
Manning, C., P. Raghavan and H. Schutze (2008). Introduction to Information Retrieval. Cambridge, England, Cambridge University Press.
McCandless, M., E. Hatcher and O. Gospodnetic (2010). Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Greenwich, CT, Manning Publications.
Morville, P. and Callender, J. (2010). Search Patterns: Design for Discovery. Sebastopol, CA. O'Reilly & Associates.
Müller, H., Clough, P., et al., eds. (2010). ImageCLEF: Experimental Evaluation in Visual Information Retrieval. Heidelberg, Germany. Springer.
Perkins, J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Birmingham, England, Packt Publishing.
Rubin, R (2016). Foundations of Library and Information Science, Fourth Edition. New York, NY, Neal-Schuman Publishers.
Shatkay, H. and M. Craven (2012). Mining the Biomedical Literature. Cambridge, Massachusetts, MIT Press.
Turnbull, D and Berryman, J (2016). Relevant Search: With Applications for Solr and Elasticsearch. Greenwich, CT, Manning Publications.
White, T. (2012). Hadoop - The Definitive Guide. Sebastopol, CA, O'Reilly Media.
Zhai, C and Massung, S (2016). Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. New York, NY, Association for Computing Machinery.
The publisher Morgan-Claypool has developed an extensive series of Synthesis Lectures on Information Concepts, Retrieval, and Services. The series consists of 50-100 page publications on a variety of topics pertaining to information science and applications of technology to information discovery, production, distribution, and management.
Fox, E., M. Goncalves and R. Shen (2012). Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. San Rafael, CA, Morgan & Claypool.
Fox, E., da Silva Torres, R. (2014). Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security. San Rafael, CA, Morgan & Claypool.
Harman, D. (2011). Information Retrieval Evaluation. San Rafael, CA, Morgan & Claypool.
Lin, J. and C. Dyer (2010). Data-Intensive Text Processing with MapReduce. San Rafael, CA, Morgan & Claypool.
Marchionini, G. (2010). Information Concepts: From Books to Cyberspace Identities. San Rafael, CA, Morgan & Claypool.
Roelleke, T. (2013). Information Retrieval Models: Foundations and Relationships. San Rafael, CA, Morgan & Claypool.
Saracevic, T (2016). The Notion of Relevance in Information Science: Everybody knows what relevance is. But, what is it really?. San Rafael, CA, Morgan & Claypool.
Another publisher, now Publishers, has created an additional series, Foundations and Trends in Information Retrieval.

The Lucene search engine continues to achieve widespread use as has been extended to the enterprise (Solr; Grainger, 2014) and in more analytic directions (Elasticsearch; Gormley, 2015). There also continue to be improvements to other open-source search engines as well as emergence of new ones, with the most definitive source now probably being a Wikipedia page.
Grainger, T and Potter, T (2014). Solr in Action. Greenwich, CT, Manning Publications.
Gormley, C and Tong, Z (2015). Elasticsearch: The Definitive Guide. Sebastopol, CA, O'Reilly & Associates.
Internet users - global and US, measured by these sites (as of June 30, 2016):
  • 3.5-3.7 billion users worldwide - about half of world, with penetration by region
    • North America - 89% (8.7% of all users)
    • Europe - 74%
    • Australia/Oceania - 73%
    • Latin America - 62%
    • Middle East - 57%
    • Asia - 46% (50.2% of all users, due to large population size)
    • Africa - 29%
  • 1.1 billion Web sites
  • 1.3 billion Google searches per day
  • 1.5 billion YouTube videos viewed per day

US Internet (Pew) use (as of early 2017):

  • 88% of adults
  • 99% of age 18-29 down to 64% of age 65+
  • 73% with home broadband
  • 95% with cell phone, 73% with smartphone
  • Other devices
    • E-reader - 22%
    • Tablet - 51%
    • Desktop/laptop computer - 78%

Smartphone users (Statista):

  • 1.8 billion in world in 2015, rising to 2.9 billion by 2020
  • 207 million in US in 2015, rising to 264 million by 2021
Computational/Internet advertising
Yuan, S, Abidin, AZ, et al. (2012). Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users, arXiv 2012.
New adversarial IR: fake news
Holan, AD (2016). 2016 Lie of the Year: Fake news. St. Petersburg, FL, Politifact.
Davis, W (2016). Fake Or Real? How To Self-Check The News And Get The Facts. Washington, DC, National Public Radio.
Oremus, W (2016). Only You Can Stop the Spread of Fake News. Slate, December 13, 2016.
Silverman, C (2016). This Analysis Shows How Fake Election News Stories Outperformed Real News On Facebook. Buzzfeed News, November 16, 2016.
Search has become essentially ubiquitous for those working in health-related disciplines, even physicians. A recent survey conducted by Google and Manhattan Research describe how much search is used by physicians (Google and Manhattan Research, 2012). Although their survey was conducted online and thus not truly be representative of all physicians, Manhattan Research claims that those surveyed are representative of of the age, gender, region, practice, and specialty setting of all physicians in the US. The survey included 506 US physicians and was conducted during February-March, 2012. Some of the key findings were:
  • Most have multiple devices: 99% with a desktop or laptop, 84% with a smartphone, and 54% with a tablet
  • They spend twice as much time using online resources as print resources
  • Even physicians aged 55+ are heavy users: 80% own a smartphone, 84% use search engines daily, and nine hours per week is spent online for professional purposes
  • Search engine use is a daily activity, with 84% using them daily, an average of six searches done per day, and 94% using Google
  • When looking for clinical or treatment information, about a third click first on sponsored listings from a search
  • About 93% of physicians say they take action based on searching, everything from pursuing more information to sharing with a patient or colleague to changing treatment decisions
  • On smartphones, searching is preferred over mobile apps, as 48% of use time is spend with a search engine, 34% is spent with mobile apps, and 18% is spent going to specific Web sites in a browser or with a bookmark
  • Physicians spend about six hours per week watching online video, with about half of that time spent for professional purposes
Anonymous (2012). From Screen to Script: The Doctor's Digital Path to Treatment. New York, NY, Manhattan Research; Google.
Search is, of course, ubiquitous for most of the rest of the US and almost all of the world. Pew Internet performed surveys about search in general and health in particular but discontinued regular updates around 2013.

By 2012, 73% of all Americans (91% of all Internet users) were using search engines (Purcell, 2012). This more recent study also presents data showing many search users are troubled by information collected about them, with 65% stating personalizing search in this manner is a "bad thing" (Purcell, 2012). Searching for health information is a common use of searching. About 72% of US adult Internet users (59% of all US adults) have looked for health information in the last year (Fox, 2013). About half the time the searches were done on behalf of the health of someone else. A smaller but still nonetheless substantial percentage, 35% of all US adults, have used the Internet to try to diagnose a medical condition they or someone else have. About 53% of "online diagnosers" talked with a clinician about what they found and 41% had their condition confirmed by a clinician.
Purcell, K, Brenner, J, et al. (2012). Search Engine Use 2012. Washington, DC, Pew Internet & American Life Project.
Fox, S and Duggan, M (2013). Health Online 2013. Washington, DC, Pew Internet & American Life Project.
A previous analysis by Fox (2011) measured the most common types of searches done and the proportion having done them:
  • 66% for specific disease or medical condition
  • 56% for certain medical treatment or procedure
  • 44% for doctors or other health professionals
  • 36% for hospitals or other medical facilities
  • 33% for health insurance, including Medicare or Medicaid
  • 29% for food safety or recalls
  • 24% for drug safety or recalls
  • 22% for environmental or health hazards
Fox has also found that more than half of whites, women, adults providing unpaid care, or adults with higher income or some college education have searched for health information, while less than 50% of African-Americans, Latinos, adults with disability, adults over 65, or adults with high school education or less or lower income have searched for health information.
Fox, S. (2011). Health Topics. Washington, DC, Pew Internet & American Life Project.
Search Engine Optimization (SEO) - overview and periodic table, Google
A recent volume by Harman (2010) updates the state of the art for system-oriented evaluation methods. As test collections get larger (to reflect the growing size of real-world collections), there is a need for better methods to select documents to get the best sampling for relevance judgments. There is also a need for new performance measures that optimize use of incomplete judgments. Yilmaz et al. (2008) have developed some measures, most notably inferred average precision (infAP) and inferred normalized discounted cumulative gain (infNDCG), which "infer" mean average precision and normalized discounted cumulative gain by making use of random sampling for judgments. Of course, this still requires judgments that are made to be of high quality, since assessor error can impact results in large ways when actual numbers of judgments are small (Carterette, 2010).

It has always been known that evaluation metrics vary, sometimes widely, by topics in test collections. Fu et al. (2011) demonstrated this with medical searches in a variety of topic areas. A related problem is that authors of papers often report "weak baselines," comparing their systems against results below the known best results for a given test collection. Armstrong et al. (2009) found that many papers assessing systems using Text Retrieval Conference (TREC) data reported baselines below the median results of the TREC conference where the collection was introduced, and the results attained rarely exceed the score of the best run from TREC.
Harman, D. (2011). Information Retrieval Evaluation. San Rafael, CA, Morgan & Claypool.
Yilmaz, E., Kanoulas, E., et al. (2008). A simple and efficient sampling method for estimating AP and NDCG. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore. 603-610.
Carterette, B. and I. Soboroff (2010). The effect of assessor error on IR system evaluation. Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), Geneva, Switzerland. 539-546.
Fu, L., Aphinyanaphongs, Y., et al. (2011). A comparison of evaluation metrics for biomedical journals, articles, and websites in terms of sensitivity to topic. Journal of Biomedical Informatics, 44: 587-594.
Armstrong, T., A. Moffat, W. Webber and J. Zobel (2009). Improvements that don't add up: ad-hoc retrieval results since 1998. Proceedings of the Conference on Information and Knowledge Management (CIKM) 2009, Hong Kong, China. 601-610.
The Text Retrieval Conference (TREC) celebrated its 25th anniversary in 2016. TREC is still the original and largest of all information retrieval (IR) challenge evaluations, still taking place on an annual basis. The Webcast of the 25th anniversary conference features a talk, The TREC Bio/Medical Tracks, which describes the various tracks in the biomedical domain at TREC over the years (starting around the 50-minute mark of Part 3).

TREC continues to inspire other work. The Cross-Language Evaluation Forum (CLEF) continues and has been renamed the Conference and Labs of the Evaluation Forum. The Question-Answering Track inspired work at IBM on Watson (Ferrucci, 2010; Ferrucci, 2012).

Ferrucci, D, Brown, E, et al. (2010). Building Watson: an overview of the DeepQA Project. AI Magazine. 31(3): 59-79.
Ferrucci, D, Levas, A, et al. (2012). Watson: beyond Jeopardy! Artificial Intelligence. 199-200: 93-105.
TREC continues to have domain-specific tracks in biomedicine beyond the Genomics Track described in the 3rd edition. TREC added a Medical Records Track in 2011 (Voorhees, 2011). The track was continued in 2012 (Voorhees, 2012) although was not continued further due to problems with the availability of the data. The task of the TREC 2011-2012 Medical Records Tracks consisted of searching electronic health record (EHR) documents in order to identify patients matching a set of clinical criteria, a use case that might be part of the identification of individuals eligible for a clinical study or trial (Voorhees, 2013). The task’s various topics each represented a different case definition, with the topics varying widely in terms of detail and linguistic complexity. This use case is one of a larger group that represent the “secondary use” of data in EHRs (Safran, 2007) that facilitate clinical research, quality improvement, and other aspects of a health care system that can “learn” from its data and outcomes (Friedman, 2010). It is made possible by the large US government investment in EHR adoption that has occurred since 2009 (DesRoches, 2015; Gold, 2016).

The corpus for the TREC Medical Records Track consisted of a set of 93,552 patient encounter files extracted from an EHR system of the University of Pittsburgh Medical Center. Each encounter file represented a note entered by a clinician or a report in the course of caring for a patient. Each note or report was categorized by type (e.g., History & Physical, Surgical Pathology Report, Radiology Report) or in some cases sub-type (e.g., Angiography).

The encounter files were each associated with one of 17,265 unique patient visits to the hospital or emergency department. Most visits (≈70%) included five or fewer reports; virtually all (≈97%) included less than 20. The maximum number of encounters comprising any visit was 415. Each encounter within a visit shared a chief complaint as well as one admission ICD-9 code and a set of discharge ICD-9 codes. The number of discharge ICD-9 codes varied widely from visit to visit; the median number of codes per visit was five, while the maximum was 25. Patients could not be linked across visits, i.e., be identified as having more than one visit, due to the de-identification process applied to the corpus. As such, for the purposes of this task, the “unit of retrieval” was the visit rather than the patient, meaning that the participating systems were to produce a set of matching visits for each topic. Visits could not be tied to individual patients and therefore visit was used as a surrogate for an individual patient meeting the given clinical criteria.

The Clinical Decision Support Track was started in 2014 (Simpson, 2014; Roberts, 2015) and continued for two years (2015, 2016). In some ways more of a traditional IR task, the track had the use case of providing documents that might answer three types of clinical questions about diagnosis, tests, and treatments. The document collection consisted of an open access subset of PubMed Central (PMC). The track used a subset as defined on January 21, 2014, containing 733,138 articles. Images and other supplementary material from PMC were also available, although not included in basic release of documents. Each of the topics in the 2014 track consisted of a case narrative plus label designating to which basic clinical task the topic pertained. The topics were developed by physicians at the National Institutes of Health (NIH), who developed ten topics for each clinical task type. Each topic statement included both a description of the problem and a shorter, more focused summary. The case narratives were used as an idealized medical record since no collections of actual medical records were available.
Voorhees, E. and Tong, R. (2011). Overview of the TREC 2011 Medical Records Track. The Twentieth Text REtrieval Conference Proceedings (TREC 2011), Gaithersburg, MD. National Institute for Standards and Technology.
Voorhees, E. and W. Hersh (2012). Overview of the TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD, National Institute for Standards and Technology.
Voorhees, EM (2013). The TREC Medical Records Track. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, Washington, DC. 239-246.
Safran, C., Bloomrosen, M., et al. (2007). Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. Journal of the American Medical Informatics Association, 14: 1-9.
Friedman, C., Wong, A., et al. (2010). Achieving a nationwide learning health system. Science Translational Medicine, 2(57): 57cm29.
DesRoches, CM, Painter, MW, et al. (2015). Health Information Technology in the United States 2015 - Transition to a Post-HITECH World. Princeton, NJ, Robert Wood Johnson Foundation.
Gold, M and McLaughlin, C (2016). Assessing HITECH implementation and lessons: 5 years later. Milbank Quarterly. 94: 654-687.
Simpson, MS, Voorhees, E, et al. (2014). Overview of the TREC 2014 Clinical Decision Support Track. The Twenty-Third Text REtrieval Conference Proceedings (TREC 2014), Gaithersburg, MD. National Institute of Standards and Technology,
Roberts, K, Simpson, M, et al. (2016). State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track. Information Retrieval Journal. 19: 113-148.
There is growing interest in retrieval (and challenge evaluations) using highly personal data, such as email, confidential documents, and patient records. This makes the standard approach to challenge evaluations difficult, in particular distributing the data of the test collections. One approach to solving this problem is the notion of Evaluation as a Service (EaaS), where the data is stored in a highly secure site and IR researchers send their experimental systems to the data. Of course, a limitation of this approach is that the researchers just get results, and not the actual data retrieved, for analysis.
Roegiest, A and Cormack, GV (2016). An architecture for privacy-preserving and replicable high-recall retrieval experiments. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy. 1085-1088.
Hanbury, A, Müller, H, et al. (2015). Evaluation-as-a-Service: Overview and Outlook, arXiv.
Other biomedical challenge evaluations have been developed as well:
  • Biocreative (Critical Assessment of Information Extraction in Biology) - focused on tasks for annotators of biological literature
  • i2b2 - long-standing yearly challenge focused on natural language processing (NLP) and information extraction, through Uzuner (2015)
  • ImageCLEFmed - in addition to ad hoc medical image retrieval, the tasks included modality classification (e.g., identify magnetic resonance imaging [MRI] or plain film x-ray images) and identification of similar cases based on a given image (e.g., with description and an image of a patient with pneumonia, find other cases of pneumonia).
  • CLEF eHealth - focused on personal health search
  • bioCADDIE - retrieval of metadata for research data sets
i2b2 current task:
i2b2: Past data sets:
ImageCLEF (including medical):
CLEF eHealth:
bioCADDIE dataset retrieval challenge:

Uzuner, O and Stubbs, A (2015). Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of Biomedical Informatics. 58(Suppl): S1-S5.

Last updated February 8, 2017