TREC 2005 Genomics Track Protocol

William Hersh, Track Chair
Last updated - November 28, 2005

This page contains the protocol for the TREC 2005 Genomics Track.  As with the 2003 and 2004 tracks, there were two tasks.  Similar to 2004, one of the tasks consisted of ad hoc retrieval, while the second involved text categorization.  There is a companion data page for the 2005 track that includes instructions for accessing the data.

Official Runs

For the ad hoc task, 58 official runs were submitted by 32 participating research groups.  For the categorization tasks, 192 runs were submitted by 19 groups.  The results are reported in the track overview paper.

Ad Hoc Retrieval Task

In the 2005 ad hoc retrieval task, we employed topics that were more structured than the mostly free-form topics from the 2004 track.  The purpose of this approach was to provide systems with better defined (yet still realistic) queries for finding genomics information.  As such, we developed topics from generic structured templates so systems could make better use of other resources, such as ontologies or databases.  We also hoped this could serve as the basis to begin investigation toward an interactive task.


As with 2004, we collected topics from real biologists.  However, instead of soliciting free-form topics, we provided the biologists we interviewed with generic templates and asked them to express information needs they had recently that fit within those templates.

While it would have been ideal to interview users and develop the templates themselves from such interviews, the time frame of the track did not allow this.  Instead, we developed a set of generic topic templates (GTTs) derived from an analysis of the topics from the 2004 track and other known biologist information needs.  After we developed the GTTs, 11 people went out interviewed 25 biologists to obtain instances of them.  We then had other people do some searching on the topics to make sure there was at least some, although not too much, information out there about them.  The topics did not have to fit precisely into the GTTs, but had to come close (i.e., have all the required semantic types).

As with 2004, there were 50 topics.  We reached closure on 5 GTTs, each of which had 10 instances, for a total of 50 topics.  The five GTTs are listed below.  (We had an extra one in case one did not pan out during the interviews, which turned out to be the case.)  The semantic types in each GTT are underlined.  For some semantic types, more than one instance is allowed.  The five GTTs are:
  1. Find articles describing standard methods or protocols for doing some sort of experiment or procedure.
  2. Find articles describing the role of a gene involved in a given disease.
  3. Find articles describing the role of a gene in a specific biological process.
  4. Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease.
  5. Find articles describing one or more mutations of a given gene and its biological impact.
In order to get participating groups started with the topics, and in order for them not to "spoil" their automatic status of their official runs by working with the official topics, we developed 10 sample topics, with two coming from each GTT.  Both the sample and official topics were available in three formats:  a Word file containing the GTTs and their instances (the topics) in tabular form, a PDF file that is a "printout" of the Word file, and a text file that has the topics expressed in a narrative form (essentially the GTTs filled in with the instances).  Here are links to the sample topics:
All of the sample topics had sample searches and associated relevance judgments for the retrieved articles.  The files with these searches and judgments were posted in the active user portion of the track Web site.  The files were named 9X.txt, where the 9X is the topic.  In the file is the PubMed search statement that generated the output, which was filtered for the time period of the MEDLINE subset (1994-2003).

In the even-numbered files (90, 92, 94, 96, 98), there is a tag DR, PR, or NR before each MEDLINE record that represents the relevance judgment.  The tag is right before the number of the document in the PubMed output, e.g., DR1 represents that the first record in the PubMed output for this particular search was definitely relevant.  In the odd-numbered files (91, 93, 95, 97, 99), the tag DR, PR, or NR that represents the relevance judgment is after each MEDLINE record.

Please note that neither the searches nor the judgments should be considered complete.  In addition, it is possible that some MEDLINE records are not in the MEDLINE subset while some that are in the test collection and should have been retrieved are not there.  These searches and judgments are simply provided "as is" for those who want to see some for these topics.

The official topics are available on active user portion of the track Web site.  A password is required to access the data that can ONLY be obtained from Lori Buckland of NIST (please do not email me to ask for it!).  The names of the files are:


The document collection for the 2005 ad hoc retrieval task was the same 10-year MEDLINE subset using for the 2004 track.  One goal we had was to produce a number of topic and relevance judgment collections that use this same document collection to make retrieval experimentation easier (so people do not have to load different collections into their systems).  More uses of this subset will be forthcoming later.

More detail about the document collection is available on the 2004 protocol page and 2004 data page, although highly pertinent information is reproduced here.  The document collection for the task is a 10-year subset of the MEDLINE bibliographic database of the biomedical literature.  MEDLINE can be searched by anyone in the world using the PubMed system of the National Library of Medicine (NLM), which maintains both MEDLINE and PubMed.  The full MEDLINE database contains over 13 million references dating back to 1966 and is updated on a daily basis.

The subset of MEDLINE for the TREC 2005 Genomics Track consists of 10 years of completed citations from the database inclusive from 1994 to 2003.  Records were extracted using the Date Completed (DCOM) field for all references in the range of 19940101 - 20031231.  This provided a total of 4,591,008 records, which is about one third of the full MEDLINE database.  A gzipped list of the PubMed IDs (PMIDs) is available (10.6 megabytes compressed).

The data includes all of the PubMed fields identified in the MEDLINE Baseline record and only the PubMed-centric tags are removed from the XML version.  A description of the various fields of MEDLINE are available at:

It should also be noted that not all MEDLINE records have abstracts, usually because the article itself does not have an abstract.  In general, about 75% of MEDLINE records have abstracts.  In our subset, there are 1,209,243 (26.3%) records without abstracts.

The MEDLINE subset is available in the "MEDLINE" format, which consists of ASCII text with fields indicated and delimited by 2-4 character abbreviations.  The size of the file uncompressed is 9,587,370,116 bytes, while the gzipped version is 2,797,589,659 bytes.  The file can be found with the 2004 files on the data portion of the Web site.  An XML version of MEDLINE subset is also available; see the 2004 data page for details.
Groups contemplating using their own MEDLINE collection or filtering from "live" PubMed should be aware of the following caveats (from Jim Mork of NLM):

Relevance Judgments

Relevance judgments were done similar to the TREC 2004 Genomics Track and other TREC tracks using the "pooling" method, where the topic ranking documents from each group's best run were pooled and given to a judge with expertise in biology.  The relevance judges were instructed in the following manner for each GTT:
  1. Relevant article must describe how to conduct, adjust, or improve a standard, a, new method, or a protocol for doing some sort of experiment or procedure.
  2. Relevant article must describe some specific role of the gene in the stated disease.
  3. Relevant article must describe some specific role of the gene in the stated biological process.
  4. Relevant article must describe a specific interaction (e.g., promote, suppress, inhibit, etc.) between two or more genes in the stated function of the organ or the disease.
  5. Relevant article must describe a mutation of the stated gene and the particular biological impact(s) that the mutation has been found to have.
In general, the articles had to describe a specific gene, disease, impact, mutation, etc. and not the concept generally.

Submitted Runs

We collected other data about submitted runs besides the system output.  One item was the run type, which fell into one of (at least) three categories:
Recall and precision for the ad hoc retrieval task wree calculated in the classic IR way, using the preferred TREC statistic of mean average precision (average precision at each point a relevant document is retrieved, also called MAP).  This was done using the trec_eval program.  The code for trec_eval is available at

The trec_eval program requires two files for input.  One file is the topic-document output, sorted by each topic and then subsorted by the order of the IR system output for a given topic.  This format is required for official runs submitted to NIST to obtain official scoring.

The topic-document ouptut should be formatted as follows:
100 Q0 12474524 1 5567     tag1
100 Q0 12513833 2 5543    tag1
100 Q0 12517948 3 5000    tag1
101 Q0 12531694 4 2743    tag1
101 Q0 12545156 5 1456    tag1
102 Q0 12101238 1 3.0    tag1
102 Q0 12527917 2 2.7    tag1
103 Q0 11731410 1 .004     tag1
103 Q0 11861293 2 .0003    tag1
103 Q0 11861295 3 .0000001 tag1
The second file required for trec_eval is the relevance judgments, which are called "qrels" in TREC jargon.  More information about qrels can be found at .  The qrels file is in the following format:
100    0    12474524    1
101    0    12513833    1
101    0    12517948    1
101    0    12531694    1
101    0    12545156    1
102    0    12101238    1
102    0    12527917    1
103    0    11731410    1
103    0    11861293    1
103    0    11861295    1
103    0    12080468    1
103    0    12091359    1
103    0    12127395    1
103    0    12203785    1

Categorization Task

The second task for the 2005 track was a categorization task.  It was similar in part to the 2004 categorization task in that it used data from the Mouse Genome Informatics (MGI) system and was a document triage task.  It included another running of one subtask from 2004, the triage of articles for GO annotation, and added triage of articles for three other major types of information collected and catalogued by MGI.  These included articles about:
  1. Tumor biology
  2. Embryologic gene expression
  3. Alleles of mutant phenotypes
As such, the categorization task looked at how well systems could categorize documents for four categories (the listed three plus plus GO annotation).  We used the same utility measure used last year but with different parameters (see below).  We created an updated version of the cat_eval program that calculates the utility measure plus recall, precision, and the F score.  We calculated utility for each of the four categorization tasks separately.
For more information about the MGI systems and its components that we will be triaging documents from, consult the following references (not all of these are freely available on the Web):


The documents for the 2004 categorization tasks consisted of the same full-text articles used in 2003.  The articles came from three journals over two years, reflecting the full-text data we were able to obtain from Highwire Press: Journal of Biological Chemistry (JBC), Journal of Cell Biology (JCB), and Proceedings of the National Academy of Science (PNAS).  These journals have a good proportion of mouse genome articles.  Each of the papers from these journals is in SGML format.  Highwire's DTD and its documentation are available.  Also the same as 2004, we designated 2002 articles as training data and 2003 articles as test data.  The documents for the tasks came from a subset of these articles that had the words "mouse" or "mice" or "mus" as described in the 2004 protocol.  A crosswalk or look-up table was provided that matches an identifier for each Highwire article (its file name) and its corresponding PubMed ID (PMID).  The table below shows the total number of articles and the number in the subset the track used.

Journal 2002 papers - total, subset
2003 papers - total, subset
Total papers - total, subset
6566, 4199
6593, 4282
13159, 8481
530, 256
715, 359
1245, 615
3041, 1382
2888, 1402
5929, 2784
Total papers
10137, 5837
10196, 6043
20333, 11880

The following table lists the the files containing the documents and related data.  The files can be found on the active user portion of the Web site.

File contents
Training data file name
Test data file name
Full-text document collection in Highwire SGML format (in 2004 data directory)
train.tar.Z test.tar.Z
Crosswalk files of PMID, Highwire file name of article, journal code, and year of paper publication (in 2005 data directory)
train.crosswalk.txt test.crosswalk.txt

The SGML training document collection is 150 megabytes in size compressed and 449 megabytes uncompressed.  The SGML test document collection is 140 megabytes compressed and 397 megabytes uncompressed.  Many gene names have Greek or other non-English characters, which can present a problem for those attempting to recognize gene names in the text.  The Highwire SGML appears to obey the rules posted on the NLM Web site with regards to these characters (

Evaluation measures

While used the utility measure as the primary evaluation measure but in a slightly different way in 2005.  This was because there are varying numbers of positive examples for the four different categorization tasks.

The framework for evaluation in the categorization task was based on the following table of possibilities:

Relevant (classified)
Not relevant (not classified)
True positive (TP)
False positive (FP)
All retrieved (AR)
Not retrieved
False negative (FN)
True negative (TN)
All not retrieved (ANR)

All positive (AP)
All negative (AN)

The measure for evaluation was the utility measure often applied in text categorization research and used by the former TREC Filtering Track.  This measure contains coefficients for the utility of retrieving a relevant and retrieving a nonrelevant document.  We used a version that was normalized by the best possible score:
Unorm = Uraw / Umax

For a test collection of documents to categorize, Uraw is calculated as follows:
Uraw = (ur * TP) + (unr * FP)
For our purposes, we assume that unr = -1 and solve for ur using MGI's current practice of triaging everything:
0.0 = ur*AP - AN
ur = AN/AP

AP and AN were different for each task, as shown in the following table:

A (alelle) TRAIN 338 5499 5837 16.27
A (alelle) TEST 332 5711 6043 17.20
E (expression) TRAIN 81 5756 5837 71.06
E (expression) TEST 105 5938 6043 56.55
G (GO annotation) TRAIN 462 5375 5837 11.63
G (GO annotation) TEST 518 5525 6043 10.67
T (tumor) TRAIN 36 5801 5837 161.14
T (tumor) TEST 20 6023 6043 301.15

(Yes, the numbers for GO annotation were different from the 2004 data.  This is because additional articles were triaged by MGI since we collected the data last year.)

The ur's for A and G are fairly close across the training and test collections, the ur's for E and especially T vary much more.  We therefore established a ur that was the average of that computed for the training and test collection, rounded to the nearest whole number. That resulted in this set of ur's for each task:

A (alelle) 17
E (expression) 64
G (GO annotation) 11
T (tumor) 231

In order to facilitate calculation of the modified version of the utility measure for the 2005 track, we updated the cat_eval program to version 2.0, which included a command-line parameter to set ur.  Here is documentation for the program:

The program has been tested on Windows and Solaris but should run on just about any OS where you can install Python.  Similar to trec_eval, there is very little error-checking for proper data format, so make sure your data files are in the format specified on the track protocol page.  If you have any issues, please email Aaron Cohen at

For, you need Python 2.3 or above installed.  Install as you would any script. 
For unzip to a directory on a Windows machine and execute cat_eval2.exe, no Python installation necessary.

Usage: data_file gold_standard_file Urelevant [-tab]
The required Urelevant parameter needs to be set to the proper weight for each of the sub-tasks.
The optional -tab argument makes the script output results in tab delimited format instead of the console readable format.

There is one sample output file for each sub-task in the active user portion of the track Web site:
The cat_eval program expects a separate file for each run of each task, where the file has three tab-separated fields:
triageG    12189157    sample
triageG    12189154    sample
triageG    12393878    sample
triageG    12451176    sample
triageG    12209011    sample
triageG    12209014    sample
Here is sample output you should get from the sample Expression sub-task: sample.E+train.txt triage.E+train.txt 64.0
Run: sample
Counts: tp=81; fp=2538; fn=0
Precision: 0.0309
Recall: 1.0000
F-score: 0.0600
Utility Factor: 64.00
Raw Utility: 2646
Max Utility: 5184
Normalized Utility: 0.5104
The official results will calculate Unorm for each of the four categories as well as an overall mean of the four.

Participants were strongly encouraged to adhere to a naming convention for their run tags that has the first letter of the tag designating the specific run:  a for allele, e for expression, g for GO, and t for tumor.


The training data came in four files, one for each category (i.e., A, E, G, and T).  (The fact that three of these four correspond to the four nucleotides in DNA is purely coincidental!)  They were named as Atrain.txt, Etrain.txt, etc. and are available in the active user portion of the track Web site.  The test data obey the same naming conventions.

What Resources Can Be Used?

A common question was, what resources can be legitimately used to aid in categorizing the documents?  In general, we allowed use of anything, including resources on the MGI Web site.  The only resource particpiants could not use was the direct data itself, i.e., data that is directly linked to the PMID or the associated MGI unique identifier.  Thus, they could not go into the MGI database (or any other aggregated resource such as Entrez Gene or SOURCE) and pull out GO codes, tumor terms, mutant phenotypes, or any other data that was explicitly linked to a document.  But anything else was fair game.

We also made available a cheatsheet developed by MGI for its curators who triage documents.  This version of the sheet was a couple years old, but given that the articles and data we are using were also that old, this sheet may actually have been more appropriate than an up-to-date one would have been.  As with the sample searches in the ad hoc task, this was provided on an "as is" basis, with no guarantees it would be helpful.  The appropriate "Areas" on the sheet that correlate to the categories we were triaging to are:

Automated tagging of mouse genes in MEDLINE corpus

Aaron Cohen of OHSU processed the entire TREC 2004 10 year Medline subset with the mouse named-entity recognizer and normalizer (NER+N) that we presented at BioLINK2005. There are two files in the TREC 2004 genomics data directory that correspond to the two halves of the split 10 year subset:
(Note that these files are from the MEDLINE corpus for the ad hoc task, but all of the full-text Highwire documents are in this corpus and can be linked using the crosswalk files.)

The files are in tab-separated format, one line for each PMID. The first field is the PMID of the MEDLINE record, followed by a variable number of fields which are the MGI identifiers for the genes found by our system in that record. Only MEDLINE records found to contain one or more mouse genes are included. The files are about 18 megabytes each.

This could be a useful addition to the corpus. However, the files were automatically generated and have not been manually reviewed, so some errors are to be expected. In Aaron's BioLink paper (available at, he reported a precision of 0.775 at a recall of 0.726 for mouse gene NER+N on the BioCreative test collection, which is competitive with the state of the art.