Information Retrieval: A Health and Biomedical Perspective, Third Edition

William Hersh, M.D.

Chapter 4 Update

This update contains all new references cited in the author's OHSU BMI 514/614 course for Chapter 4.

Some recent updates about the National Center for Biomedical Ontologies (NCBO, and its main project, the repository of biomedical ontologies called BioPortal ( (Whetzel et al., 2011; Whetzel, 2013).

An important ontology is one that aims to model all human phenotypes, the Human Phenotype Ontology (Köhler et al., 2017)
Whetzel, PL, Noy, NF, et al. (2011). BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications. Nucleic Acids Research. 39: W541-W545.
Whetzel, PL (2013). NCBO technology: powering semantically aware applications. Journal of Biomedical Semantics. 15(4 Suppl 1): S8.
Köhler, S, Vasilevsky, NA, et al. (2017). The Human Phenotype Ontology in 2017. Nucleic Acids Research. 45: D865-D876.

Gene naming is still a challenge, especially with Microsoft Excel, with its automated conversions of dates and floating-point numbers leading to as much as one-fifth of gene names in the biomedical literature being erroneous.
Ziemann, M, Eren, Y, et al. (2016). Gene name errors are widespread in the scientific literature. Genome Biology. 17: 177.
The MeSH vocabulary continues to evolve.

According to the latest fact sheet, the 2016 version of MeSH had 27,883 descriptors (headings) with 87,883 entry terms. MeSH also contains more than 232,000 Supplementary Concept Records (SCRs) that consist of additional chemicals, diseases, and drug protocols.

In 2006, MeSH was expanded from 15 to 16 trees, with the Publication Characteristics (V) tree added to account for the growing number of characteristics of publications (Nahin, 2005). (I missed this when writing the third edition of the book in 2009.) The 16 trees are now:
  • Anatomy [A]
  • Organisms [B]
  • Diseases [C]
  • Chemicals and Drugs [D]
  • Analytical, Diagnostic and Therapeutic Techniques, and Equipment [E]
  • Psychiatry and Psychology [F]
  • Phenomena and Processes [G]
  • Disciplines and Occupations [H]
  • Anthropology, Education, Sociology, and Social Phenomena [I]
  • Technology, Industry, and Agriculture [J]
  • Humanities [K]
  • Information Science [L]
  • Named Groups [M]
  • Health Care [N]
  • Publication Characteristics [V]
  • Geographicals [Z]
Nahin, AM (2005). PubMed® Notes for 2006. NLM Technical Bulletin.
The Gene Ontology (GO) continues to grow, with over 40,000 concepts in its three ontologies:
  • Biological Process
  • Cellular Component
  • Molecular Function
There has also been growth of the GO evidence codes, which are now categorized:
  • Experimental Evidence Codes
    • EXP: Inferred from Experiment
    • IDA: Inferred from Direct Assay
    • IPI: Inferred from Physical Interaction
    • IMP: Inferred from Mutant Phenotype
    • IGI: Inferred from Genetic Interaction
    • IEP: Inferred from Expression Pattern
  • Computational Analysis Evidence Codes
    • ISS: Inferred from Sequence or Structural Similarity
    • ISO: Inferred from Sequence Orthology
    • ISA: Inferred from Sequence Alignment
    • ISM: Inferred from Sequence Model
    • IGC: Inferred from Genomic Context
    • IBA: Inferred from Biological aspect of Ancestor
    • IBD: Inferred from Biological aspect of Descendant
    • IKR: Inferred from Key Residues
    • IRD: Inferred from Rapid Divergence
    • RCA: inferred from Reviewed Computational Analysis
  • Author Statement Evidence Codes
    • TAS: Traceable Author Statement
    • NAS: Non-traceable Author Statement
  • Curator Statement Evidence Codes
    • IC: Inferred by Curator
    • ND: No biological Data available
  • Automatically-assigned Evidence Codes
    • IEA: Inferred from Electronic Annotation
  • Obsolete Evidence Codes
    • NR: Not Recorded

A questionnaire about the use of the UMLS by informatics researchers had responses from 70 users (Chen, 2007). The two major intended uses were access to source terminologies (75%) and mapping among source terminologies (44%). The most common reported uses were:
  • Terminology research (31%)
  • Information retrieval (16%)
  • Terminology translation (12%)
Others reported UMLS was used as a terminology itself (77%) and stated they wanted NLM to develop unified hierarchy and derive a terminology (73%).
Chen, Y., Perl, Y., et al. (2007). Analysis of a study of the users, uses, and future agenda of the UMLS. Journal of the American Medical Informatics Association, 14: 221-231.
The NCI Thesaurus Web site ( allows downloading, searching, and browsing. There is also an NCI Metathesaurus that includes about 75 other cancer-related terminologies.

Another controlled terminology system is an ontology that is aimed for use for simple manual indexing for the Web, has been developed by the major search engine companies - Google, Microsoft, and Yahoo - and is called ( is supported by the Community Group ( The schemas are designed to be "microdata" that can be used to index digital content, such as Web pages. They can even be included in the Web page HTML themselves. The schemas consist of a collection of "types," each of which are associated with a set of "properties." The types are arranged in a hierarchy. The core vocabulary currently consists of nearly 600 types, over 800 properties, and over 100 enumeration values for the properties. The community has also developed a process for "extensions" to the basic schemas, which can be "hosted" as part of the project or "external" and be maintained by outside organizations. One important example of the latter is MedicalEntity (, which is related to health and the practice of medicine. As noted on its Web page, this schema "is not intended to define or codify a new controlled medical vocabulary, but instead to complement existing vocabularies and onotologies. As a schema, its focus is on surfacing the existence of and relationships between entities described in content; the specific convention(s) used to name and/or code entities are outside of the scope of this schema. The schema does provide a way to annotate entities with codes that refer to existing controlled medical vocabularies (such as MeSH, SNOMED, ICD, RxNorm, UMLS, etc.) when they are available."

Author names continue to be a challenge for bibliographic and other databases, especially as others establish linkages and metrics based on them. This is becoming even more problematic with the increasing number and productivity of Chinese authors, who tend to have short and simple names (Qiu, 2008). A number of different systems had been proposed for author identifiers (Bourne and Fink, 2008; Enserink, 2009; Fenner, 2011), but an emerging standard has been Open Researcher and Contributor ID (ORCID, Over three million scientific authors have signed up, and many journals now require them to be designated when papers are submitted for publication. My ORCID is, which can be used in a URL that links to a Web page listing publications and other information (

Of course, the computability and probably reproducibility of science would be enhanced by unique identifiers for all resources that are used and described in research (
Vasilevsky et al., 2013).
Qiu, J. (2008). Scientific publishing: identity crisis. Nature, 451: 766-767.
Bourne, P. and Fink, J. (2008). I am not a scientist, I am a number. PLoS Computational Biology, 4(12): 19112480
Enserink, M. (2009). Scientific publishing. Are you ready to become a number? Science, 323: 1662-1664.
Fenner, M (2011). Author identifier overview. Libreas Library Ideas. 18
Vasilevsky, NA, Brush, MH, et al. (2013). On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ. 5(1): e148.
While the 15-element Dublin Core Metadata set continues to have wide influence and use, the Dublin Core Metadata Initiative (DCMI, has expanded efforts to developing application profiles. A Dublin Core Application Profile (DCAP) specifies and describes the metadata used in a particular application. To accomplish this, a profile (quoted from
  • Describes what a community wants to accomplish with its application (Functional Requirements)
  • Characterizes the types of things described by the metadata and their relationships (Domain Model)
  • Enumerates the metadata terms to be used and the rules for their use (Description Set Profile and Usage Guidelines)
  • Defines the machine syntax that will be used to encode the data (Syntax Guidelines and Data Formats)

An overview of collaborative filtering and related recommender systems is provided by Terveen and Hill (2001). A survey of more recent collaborative filtering techniques is described by Su and Khoshgoftaar (2009). These approaches are used in what are now called recommender systems and are in common use in many commercial Web sites, such as Amazon and Netflix. Caplan and Rosenthal (2013) have proposed collaborative filtering approaches for use in identifying unknown clinical cases. Likewise, Wiesner and Pfeifer (2014) note that the growing amount of clinical data captured in EHR and other sources could lead to recommender systems for patients. Terveen, L and Hill, W (2001). Beyond Recommender Systems: Helping People Help Each Other. Human-Computer Interaction in the New Millennium. J. Carroll. Reading, MA, Addison-Wesley.
Su, X and Khoshgoftaar, TM (2009). A survey of collaborative filtering techniques. 
Advances in Artificial Intelligence. 2009: 421425.
Caplan, E and Rosenthal, N (2013). Collaborative Filtering: An Interim Approach To Identifying Clinical Doppelgängers. Health Affairs Blog.
Wiesner, M and Pfeifer, D (2014). Health recommender systems: concepts, requirements, technical basics and challenges. International Journal of Environmental Research and Public Health. 11: 2580-2607.
Internet advertising is here to stay (Yuan et al., 2012), so the selection of words and terms to promote content based on willingness to pay needs to be considered a form of manual indexing. It is also becoming prominent in social media as well, e.g., Facebook, Twitter, etc.
Yuan, S, Abidin, AZ, et al. (2012). Internet Advertising: An Interplay among Advertisers, Online Publishers, Ad Exchanges and Web Users, arXiv 2012.
The Open Directory Project is now defunct, but an online forum persists for those who were involved (

The Health Education Assets Library (HEAL) URL has changed to

A recent New York Times article on Google's "schooling" of its search algorithms to keep up with those trying to game it (Lohr, 2011). The challenges of large-scale Web indexing have given rise to new approaches to handling petabyte quantities of data per day. This has led Google to develop MapReduce, which is deigned to process such data in parallel and when portions are not immediately available (Dean and Ghemawat, 2008; Lin and Dyer, 2010). An open-source implementation of this approach is Hadoop (
Lohr, S. (2011). Google Schools Its Algorithm. New York Times. March 5, 2011.
Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. 
Communications of the ACM, 51(1): 107-113.
Lin, J. and Dyer, C. (2010). Data-Intensive Text Processing with MapReduce. San Rafael, CA. Morgan & Claypool Publishers.
Another challenge to Web indexing is that we are in the era of "adversarial" IR, where we may want to explicitly not retrieve some content (Castillo, 2011). Castillo, C and Davison, BD (2011). Adversarial Web Search. Delft, Netherlands, now Publishers.
While not indexing per se, an interesting approach to summarizing the content of document(s) is Wordle ( An app for the Sciverse system has been create that performs a Wordle on one's scientific publications in its comprehensive bibliographic database. The Wordle of my publications is not terribly surprising.
A more recent approach to learning object metadata (and much simpler than others) is the Learning Resource Metadata Initiative, which extends though a collection of properties that describe educational resources. LRMI is now maintained by DCMI ( LRMI predominantly uses the properties of resources of type, which were proposed to by LRMI to describe the educational characteristics of learning resources. For some of the properties, it uses the LRMI-created types and Version 1.1 of the LRMI specification has been accepted into

Medbiquitous still provides indexing of health professional education learning objects but also is involved in many more standards, such as tracking for continuing medical education (CME) and management of learning competencies.

Last updated April 3, 2017