Information Retrieval: A Health and Biomedical Perspective, Third Edition

William Hersh, M.D.

Chapter 3 Update

This update contains all new references cited in the author's OHSU BMI 514/614 course for Chapter 3.

Note that the names, content, and URLs of Web sites constantly change, especially in health and biomedicine. This update does not update every last site mentioned in the textbook, but focuses on the more important ones.
A good overview about metadata has recently been updated.
Riley, J (2017). Understanding Metadata: What Is Metadata, and What Is It For? Baltimore, MD, National Information Standards Organization.
The 2015 MEDLINE/PubMed baseline contains over  23 million citations, with an additional 4 million citations in PubMed but not in MEDLINE.
NLM continues to revamp all (including bibliographic) databases, with former subject-specific databases (e.g., AIDSLINE, Cancerlit) now "subsets" within MEDLINE/PubMed. NLM has also retired the Gateway interface, steering users instead to its other interfaces. One of those interfaces is GQuery, which is also sometimes called Enrtez, and provides a global entry point to all NLM databases.

A description of all the NLM databases is published each year in the annual Database Issue of the journal, Nucleic Acids Research (2017).

NLM is also no longer adding scientific meeting abstracts from HIV/AIDS or cancer to its databases. It is, however, still adding meeting papers from informatics conferences, such as AMIA.

Anonymous (2017). Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research. 45: D12-D17.
PubMed Health is a new NLM resource that links to reviews of comparative effectiveness research. Its database includes content from sources such as:
  • Agency for Health Care Research and Quality (US) (AHRQ)
  • The Cochrane Collaboration (CC)
  • National Institute for Health and Clinical Excellence guidelines program (NICE)
  • Oregon Health and Science University's Drug Effectiveness Review Project (DERP)
  • Department of Veterans Affairs' Evidence-based Synthesis Program from the Veterans Health Administration R&D (VA ESP)
Another Web feed standard is Atom, although both it and RSS have been declining in use.

PubMed Central continues to grow, although only about 1900 journals are full participants. Content can be viewed in several formats:
  • Classic
  • PDF
  • PubReader – optimized for Web reading
  • ePub format

Papers from the AMIA Annual Symposium Proceedings (and the Symposium for Computer Applications in Medical Care, or SCAMC, before it) are all available now in PubMed Central.
Other informatics-related journal content available in full text in PubMed Central includes:
  • Bioinformatics
  • Journal of the American Medical Informatics Association (JAMIA) -
  • Journal of the Medical Library Association (JMLA) -
  • Perspectives in Health Information Management -
  • BMC journals
  • PLoS journals

A growing number of major journals are scanning archives back to their inception, such as JAMA to 1883 and the BMJ to 1840 (Delmothe, 2009). The AMA has also revamped not only its journal sites, but also changed the names of some journals to include the JAMA moniker, e.g., Archives of Internal Medicine became JAMA Internal Medicine in 2013 (Winker et al., 2012).
Delamothe, T. (2009). The new BMJ online archive. British Medical Journal, 338: 1025-1026.
Winker, MA, Herron, M, et al. (2012). The JAMA Network Website: today's content on the future of medical publishing. Journal of the American Medical Association. 307: 2321.
The NLM Bookshelf continues to grow, and now has over 4500 items. The books are presented in a variety of formats, including PubReader, which is also used for PubMed Central articles.

Of note, the famous online genetics textbook, Online Mendelian Inheritance in Man (OMIM), is no longer maintained by the NLM. Although NLM Entrez still allows searching of OMIM, results are passed to off to the new OMIM site. In addition, OMIM does not link back to NCBI databases.
The text of the online encyclopedia Wikipedia can be downloaded. This resource is also making efforts to improve the quality of its health-related content (Heilman, 2013).

Heilman, J (2013). Online encyclopedia provides free health info for all. Bulletin of the World Health Organization. 91: 8-9.
A number of early reports from early IR researchers have been scanned and made available as part of the SIGIR Museum.
Some consumer health sites have gone away (most notably, Intelihealth and Medpedia), while other new ones have emerged, such as the site from the Mayo Clinic.
Some of the URLs for clinical practice guidelines have changed, including those from the American College of Cardiology, the American College of Physicians, the American Academy of Pediatrics, the Institute for Clinical Systems Improvement, and the International Diabetes Federation. The guidelines from University of California San Francisco no longer seem to be available.
Among the many blogs is my own, The Informatics Professor.
InfoPOEMS and InfoRetriever, along with the Cochrane Database of Systematic Reviews, are now part of Essential Evidence Plus. Other evidence-based resources that provide access to different varying types of content include:
  • JAMAevidence - from the AMA
  • ACCESSSS - from McMaster University
All of these may well be classified as aggregations.
The BrighamRad collection of teaching files no longer appears to be available. Fortunately a number of other radiology teaching file collections are available:
  • is a resource for clinicians that allows posting of anonymized cases that can include one or more images
  • Lieberman's eRadiology provides a wealth of tutorials and teaching cases
  • Medpix is a collection of medical images, teaching cases, and clinical topics that combines images and textual metadata including over 19,000 patient case scenarios and nearly 54,000 images

A new image resource is Viziometrics, which contains diagrams, visualizations, and photographs from scientific publications (Lee, 2016).

Lee, P, West, JD, et al. (2016). Viziometrics: Analyzing Visual Information in the Scientific Literature. arXiv.
A new and growing type of annotated content is clinical decision support (CDS) for use in the electronic health record (EHR), including decision rules and order sets. Some providers are commercial companies, such as Zynx and Thomson Reuters, as well as EHR vendors themselves.

Another CDS resource, beginning initially as a dermatology collection and then expanding to other images, is VisualDX, which also has mobile app version.
An excellent source of information for omics and related data is the annual Database Issue of the journal, Nucleic Acids Research (Galperin, 2017). A prominent article in each year's issue is an overview of the database resources from the NLM National Center for Biotechnology Information (NCBI, 2017). The NLM continues to evolve and improve its genomics resources in response to new technologies, data, types, and usability concerns. A key feature is linkage across databases.

Another information source for genomics is Gene Wiki, which is an effort to annotate the human genome within Wikipedia (Hoffmann, 2008; Huss et al., 2008).

One of the largest amounts of activity going on now is discovering the clinical effects (phenotype) of genomic variation (genotype). As such, a new resource, ClinVar, has been developed (Landrum, 2016).
Galperin, MY, Fernández-Suárez, XM, et al. (2017). The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes. Nucleic Acids Research. 45: D1-D11.
Anonymous (2017). Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research. 45: D12-D17.
Hoffmann, R. (2008). A wiki for the life sciences where authorship matters. Nature Genetics, 40: 1047-1051.
Huss, JW, Orozco, C, et al. (2008). A gene wiki for community annotation of gene function. PLoS Biology. 6(7): e175.
Landrum, MJ, Lee, JM, et al. (2016). ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Research. 44: D862-D868.

The database continues to evolve. Starting from a catalog of clinical trials sponsored by NIH, it has evolved to become a trials registration system (Zarin, 2013; Zarin, 2017), and now a means to report results of trials. A new rule expanding the legal mandate for sponsors and others responsible for certain clinical trials of FDA-regulated drug, biologic, and device products to register their studies and report summary results information to was announced in 2016 (Zarin, 2016). A next step under consideration is the inclusion of individual patient data (Zarin, 2016).

Zarin, DA and Tse, T (2013). Trust but verify: trial registration and determining fidelity to the protocol. Annals of Internal Medicine. 159: 65-67.
Zarin, DA, Tse, T, et al. (2017). Update on trial registration 11 years after the ICMJE policy was established. New England Journal of Medicine. 376: 383-391.
Zarin, DA, Tse, T, et al. (2016). Trial reporting in — the final rule. New England Journal of Medicine. 375: 1998-2004.
Zarin, DA and Tse, T (2016). Sharing individual participant data (IPD) within the context of the trial reporting system (TRS). PLoS Medicine. 13(1): e1001946.
Many believe that the next step in data transparency and utility in clinical trials is to make researchers share their raw data. High-impact clinical journals do not require research data to be made available, and when they do, investigators rarely adhere to the requirement (Alsheikh-Ali et al., 2011). It is argued that clinical trial data is a "public good" (Rodwin and Abramson, 2012) and that we must usher in a new era of "open science through data sharing" (Ross and Krumholz, 2013). Eicher et al. (2013) examine the issues for and against this notion, expressing concern about patient privacy and data being "vulnerable to distortion." Other issues have been raised by the International Consortium of Investigators for Fairness in Trial Data Sharing, who express concern over misleading or inaccurate analyses as well as efforts aimed at discrediting or undermining the original research (Anonymous, 2016). They also express concern about the costs, given that there are over 27,000 RCTs performed each year. As such, this group calls for an embargo on reuse of data for two years plus another half-year for each year of the length of the RCT.

If datasets from clinical trials and other research are to be made available, what kind of databases are designed and populated? An effort funded by the US NIH has been the biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) (Ohno-Machado, 2015). bioCADDIE can be accessed via the DataMed search interface and is based on
a data tag suite (DATS) for datasets (Sansone, 2017).

This and related efforts have built upon the FAIR principles of Findability, Accessibility, Interoperability and Reusability (Wilkinson, 2016).

Another source of research data sets is, whose metadata schema has been described by
Rücknagel et al. (2015).
Alsheikh-Ali, AA, Qureshi, W, et al. (2011). Public availability of published research data in high-impact journals. PLoS ONE. 6(9): e24357.
Rodwin, MA and Abramson, JD (2012). Clinical trial data as a public good. Journal of the American Medical Association. 308: 871-872.
Ross, JS and Krumholz, HM (2013). Ushering in a new era of open science through data sharing: the wall must come down. Journal of the American Medical Association. 309: 1355-1356.
Eichler, HG, Abadie, E, et al. (2012). Open clinical trial data for all? A view from regulators. PLoS Clinical Trials. 9(4): e1001202.

Anonymous (2016). Toward fairness in data sharing. New England Journal of Medicine. 375: 405-407.
Ohno-Machado, L, Alter, G, et al. (2015). bioCADDIE white paper - Data Discovery Index. La Jolla, CA, Univeristy of California San Diego.
Sansone, SA, Gonzalez-Beltran, A, et al. (2017). DATS: the data tag suite to enable discoverability of datasets. bioRxiv.
Wilkinson, MD, Dumontier, M, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 3: 160018.
Rücknagel, J, Vierkant, P, et al. (2015). Metadata Schema for the Description of Research Data Repositories: version 3.0. Potsdam, Germany, Helmholtz Centre Potsdam.
The US Food and Drug Administration (FDA) has improved its information availability. Two resources of note include:
  • Medwatch - reporting system for adverse events
  • Drugs@FDA - database of FDA-approved products
A comparison of results reporting for new drug approval trials from and Drugs@FDA found results congruent but former with more adverse events details (Schwartz, 2016).
Schwartz, LM, Woloshin, S, et al. (2016). and Drugs@FDA: a comparison of results reporting for new drug approval trials. Annals of Internal Medicine. 165: 421-430.
The CRISP database of grants funded by NIH has been retired and replaced by the NIH RePORTER system.
Another interesting type of data (or representation of data) is cartograms, with a good hub source being Cartogram Central.
More and more publishers are aggregating their content into large collections that can be marketed as a single entity. These include the Scitable resource of Nature Publishing, Sciverse of Elsevier, and several of the resources described in section 3.2 above. Sciverse has an application programming interface (API) that allows others to write interactive apps. The NLM also has eUtilities with similar functionality.
The URL for DrugBank has changed.
Another model organism database is the Zebra Fish Information Network. I have had a chance to visit the Zebra Fish colony in Eugene, OR!
The future? NIH Commons aims to be shared space for all digital objects of biomedical research.

Last updated March 12, 2017