OHSUMED Test Collection

This test collection was created to assist information retrieval research. It is a clinically-oriented MEDLINE subset, consisting of 348,566 references (out of a total of over 7 million), covering all references from 270 medical journals over a five-year period (1987-1991). The test database is about 400 megabytes in size. A number of fields normally present in the MEDLINE record but not pertinent to content-based information retrieval have been deleted. The only fields present include the title, abstract, MeSH indexing terms, author, source, and publication type. Since this database is neither up-to-date nor complete, it is useless as a tool for real searchers and only useful for research purposes.

The test collection was built as part of a study assessing the use of MEDLINE by physicians in a clinical setting (1). Novice physicians using MEDLINE generated 106 queries. Before they searched, they were asked to provide a statement of information about their patient as well as their informaiton need. Each query was later replicated by four searchers, two physicians experienced in searching and two medical librarians. The results were assessed for relevance by a different group of physicians, using a three point scale: definitely, possibly, or not relevant. There were 12,565 unique query-reference pairs. Over 10% of the query-document pairs were judged in duplicate to assess interobserver reliability.

The original test collection was subsequently used in experiments with the SMART retrieval system (2). As would be expected, SMART retrieved a number of references not retrieved by the original searchers were retrieved by SMART. A second round of relevance judgments was done after these experiments. There were 3,575 new query-reference pairs judged, along with an overlap of over 10% to again assess interobserver reliability.

Thus there are now a total of 16,140 query-document pairs that have been judged for relevance. These are in a file (judged), along with each of the relevance judgments done. There are also files that list relevant query-document pairs (drel.i, drel.ui, pdrel.i, and pdrel.ui). In these files, only the original relevance judgment is used.

(Note: There are five queries for which there are no definitely relevant documents, and you may wish to delete these from your experiments. They are being left in the query file for this distribution, because further analysis may uncover relevant documents for them. Some systems, such as SMART, automatically drop queries with no relevant documents from their analysis. The five queries for which no definitely relevant documents exist are 8, 28, 49, 86, and 93.)

The National Library of Medicine has agreed to make the MEDLINE references in the test database available for experimentation, restricted to the following conditions:
1. The data will not be used in any non-experimental clinical, library, or other setting.
2. Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

There are 13 files that make up the test collection, and each is described below. (For those of you receiving compressed files, you will obtain only seven files. Each of the files 1-5 below is compressed by itself, and has the suffix .tar.Z. All of the files 6-12 are compressed into one file, which is called ohsumed.rest.tar.Z. The final file is this file, readme, which is not compressed.)

Here are the files, their uncompressed size, and a description of their content:

1) ohsumed.87 (60,303,307) -- Contains the MEDLINE documents for the year 1987. The format for each of the MEDLINE document files follows the conventions of the SMART system, with each field defined as below (NLM designator in parentheses):
   .I   sequential identifier
   .U   MEDLINE identifier (UI)
   .M   Human-assigned MeSH terms (MH)
   .T   Title (TI)
   .P   Publication type (PT)
   .W   Abstract (AB)
   .A   Author (AU)
   .S   Source (SO)
(Note: Some references have their abstracts truncated at 250 words, while some have no abstracts at all.)

2) ohsumed.88 (78,585,929) -- Contains the MEDLINE documents for the year 1988, formatted as above.

3) ohsumed.89 (84,719,077) -- Contains the MEDLINE documents for the year 1989, formatted as above.

4) ohsumed.90 (86,754,890) -- Contains the MEDLINE documents for the year 1990, formatted as above.

5) ohsumed.91 (89,761,122) -- Contains the MEDLINE documents for the year 1991, formatted as above.

6) queries (11,591) -- Contains the 106 queries in test set, with patient and topic information, in the format:
   .I   Sequential identifier
   .B   Patient information
   .W   Information request

7) drel.ui (26,919) -- Contains the query-document pairs rated as definitely relevant, with documents listed by MEDLINE UI, in the format:
   <query><tab><document-ui>

8) drel.i (21,709) -- Contains the query-document pairs rated as definitely relevant, with documents listed by sequential number (from the .I field), in the format:
   <query><tab><document-i>

9) pdrel.ui (57,831) -- Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by MEDLINE UI, in the format:
   <query><tab><document-ui>

10) pdrel.i (46,664) -- Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by sequential number (from the .I field), in the format:
   <query><tab><document-i>

11) judged (368,366) -- Contains a list of all retrieved documents by any of the five original searchers or SMART, sorted first by query number and then document number, along with their relevance judgments. The relevance judgments are either d (definitely relevant), p (possibly relevant), or n (not relevant). The relevance1 judgment is the original relevance judgment done on the documents retrieved by the original searchers. The relevance 2 judgment is the second relevance judgment done to assess interobserver reliability of the relevance1 judgments. The relevance3 judgment is the relevance judgment done on documents retrieved by SMART but not the original searchers, or another relevance judgment on an originally retrieved document to assess interobserver reliability.
   <query><tab><document-ui><tab><document-i><tab>
   <relevance1>[<tab><relevance2>][<tab><relevance3>]

12) ui (3,137,094) -- Contains the MEDLINE UI's for all 348,566 documents in test database, listed one per line.

13) readme -- This file.

We realize that due to the relative recall procedures used in building this collection, as well as the subjective nature of relevance judgments, that there may be disagreements about the relevance judgments. I do want to be able to update the collection, but I want to do it a systematic fashion, so that results among researchers will be comparable. Therefore I am asking that results be reported based on this collection unchanged. If you find new documents that you feel are relevant, or if you find documents for which you disagree with the relevance judgment, please notify me by email or in writing. Periodically, we will update the relevance judgments and release updated versions.

This work was made possible with support from NLM Grant LM05307. All opinions expressed and relevance judgments made, however, are the responsiblity of William Hersh.

For more details, contact:

William Hersh, M.D.
Assistant Professor of Medicine and Medical Informatics
Oregon Health Sciences University
BICC
3181 SW Sam Jackson Park Rd.
Portland, OR   97201
Voice: 503-494-4563
Fax: 503-494-4551
Email: hersh@ohsu.edu

Bibliography:

1. Hersh WR, Hickam DH, Use of a multi-application computer workstation in a clinical setting, Bulletin of the Medical Library Association, 1994, 82: 382-389.

2. Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive retrieval evaluation and new large test collection for research, Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.