TREC 2003 Genomics Track - Protocol

William Hersh, Track Chair
Oregon Health & Science University
hersh@ohsu.edu

Last updated - November 30, 2008

This document is the final official protocol for the TREC 2003 Genomics Track. The track consisted of a primary and secondary task. The primary task was a conventional ad hoc retrieval task with success measured by recall-precision. The secondary task explored information extraction from article abstracts. (NOTE: The LocusLink database used in this track has been superceded by EntrezGene.

Some important URLs

Text Retrieval Conference (TREC) - http://trec.nist.gov /
TREC Genomics Track - http://ir.ohsu.edu/genomics/

Getting started

The following underlined texts shows the names of the links that provide access to various parts of the data from the protected area of this Web site:

Genomics Track Homepage links to this site.
Primary Task Documents links to a page that provides instructions on what must be done to obtain the document collection. You must complete the data use form and send it to NIST, after which you will receive a password to access the document collection from this page.
Primary Task - Training Relevance Judgments (qrels) links to the relevance judgments for the training data. This file consists of GeneRIFs from LocusLink for the 50 training topics.
Primary Task - Training Topics links to the training topics. This file consists of gene names from LocusLink.
Primary Task - Test Topics links to the official test topics. This file also consists of gene names from LocusLink.
Primary Task - Test Relevance Judgments (qrels) links to the relevance judgments for the test data. This file consists of GeneRIFs from LocusLink for the 50 test topics.
Secondary Task Data links to the data for the secondary task.
Secondary Task Optional Full-Text Documents links to the full-text of the 139 articles referenced by the GeneRIFs in the secondary task. (The use of these is optional but they are there if you want them.)
Secondary Task Perl code is a WinZip file containing the code and sample files for calculating results for Dice coefficient and derivative measures.

General principles

A number of general principles have emerged concerning the 2003 track from the various workshops and other discussions held so far:

We want to use document and topic sets large enough in scope to be meaningful from an IR standpoint.
We want the first year to be constrained enough so as not to require a great deal of resources on the part of NIST or the track.
Our initial task will focus on document/text retrieval.
Our system user is considered a biology researcher at the level of an established researcher seeking information in a new area or a graduate student wanting to find information about a topic.
We will try to develop tasks/topics that reflect the real information needs of biology researchers.
If possible, we will try to make use of existing resources for relevance judgments in the first year, given our constraints.
Our first year’s work will also include identifying or applying for resources that will enable us to develop the experimental milieu that we need to carry out meaningful research.

Most of our decisions for the 2003 track emanate from the choice of resource we plan to use for relevance judgments, which is the Gene References into Function (GeneRIF) field of LocusLink. GeneRIFs are relatively new, but are thought by consensus to be the most consistent and comprehensive resource for relevance judgments available short of developing our own. Their major limitation is that they have only been systematically applied by NLM indexers since around 4/1/2002. There are other possible options, such as the literature annotations in some of the other model organisms, in particular mouse and yeast, but we have decided to go with GeneRIFs for now.

Primary task

As noted above, the primary task will consist of ad hoc document retrieval. This type of task requires a document collection, topics, and relevance judgments. For those wanting a general overview of the general approach of TREC, we suggest the overview from the TREC 2002 conference .

Documents

The document collection consists of 525,938 MEDLINE records where indexing has been completed between 4/1/2002 and 4/1/2003. The data are available on the TREC Web site, which is password-protected and limited to only those who have signed up to participate in TREC. The MEDLINE records are in a single compressed text file. (A 10,000-record subset is also available to those wanting to get started with a smaller collection of data.) The following files will be available:

trec-medline.gz - full collection of 525,938 MEDLINE records for track
trec-medline10k.gz - 10,000-record subset of MEDLINE

The data format is the standard NLM MEDLINE format and XML. In the NLM format, the delimiter between records is a blank line. The fields are indicated by their 2-3 letter abbreviation. The fields likely to be most important to track participants are: PubMed Unique Identifier (PMID), title (TI), abstract (AB), and MeSH headings (MH). A description of all the fields in a MEDLINE record can be found in the PubMed help file at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#MEDLINEDisplayFormat

Particpating groups must load the data into an IR system to perform the experiments. This will likely involve reformatting the data to meet the input format of the system they choose to use. Although MEDLINE data are publicly available on the Internet, participants must remember that MEDLINE records from this collection must not be posted or otherwise made available on a public Web site , since this subset is an incomplete version of the MEDLINE database. A common system used for IR experimentation you may want to consider using is SMART, which is available at ftp://ftp.cs.cornell.edu/pub/smart/ .

Topics

The topics consist of gene names, with the specific task being as follows:

For gene X, find all MEDLINE references that focus on the basic biology of the gene or its protein products from the designated organism. Basic biology includes isolation, structure, genetics and function of genes/proteins in normal and disease states.

We are distributing training and test topic sets of 50 genes each. The training data are provided for groups to get an idea of what the data in this year's track are like. The test data are the topics for the official runs in the track. For each set of 50 topics, we have randomly chosen gene names that are distributed across the spectrum of organisms, the number of GeneRIFs (many to few), name types (see below), and whether or not the gene names are Medical Subject Heading (MeSH) indexing terms. The training data includes a file of GeneRIFs that comprise the relevance judgments for these topics. (See below for more details.)

Gene names

The file of gene names has the following format:

3	2120	Homo sapiens	OFFICIAL_GENE_NAME	ets variant gene 6 (TEL oncogene)

3	2120	Homo sapiens	OFFICIAL_SYMBOL	ETV6

3	2120	Homo sapiens	ALIAS_SYMBOL	TEL

3	2120	Homo sapiens	PREFERRED_PRODUCT	ets variant gene 6

3	2120	Homo sapiens	PRODUCT	ets variant gene 6

3	2120	Homo sapiens	ALIAS_PROT	TEL1 oncogene

4	2252	Homo sapiens	OFFICIAL_GENE_NAME	fibroblast growth factor 7 (keratinocyte growth factor)

4	2252	Homo sapiens	OFFICIAL_SYMBOL	FGF7

4	2252	Homo sapiens	ALIAS_SYMBOL	KGF

4	2252	Homo sapiens	ALIAS_SYMBOL	HBGF-7

4	2252	Homo sapiens	PREFERRED_PRODUCT	fibroblast growth factor 7 precursor

4	2252	Homo sapiens	PRODUCT	fibroblast growth factor 7 precursor

4	2252	Homo sapiens	ALIAS_PROT	keratinocyte growth factor

where:

The first column is the official topic number (1-50).
The second column contains the LocusLink ID for the gene.
The third column contains the name of organism.
The fourth column contains the gene name type.
The fifth column contains the gene name.

There are four possible organisms:

Homo sapiens - human
Mus musculus - mouse
Rattus norvegicus - rat
Drosophila melanogaster - fruit fly

There are five possible name types:

OFFICIAL_GENE_NAME
PREFERRED_GENE_NAME
OFFICIAL_SYMBOL
PREFERRED_SYMBOL
PREFERRED_PRODUCT

The rules for naming (from the LocusLink help file ) are as follows:

[OFFICIAL|PREFERRED]_GENE_NAME:

    [alphanumeric] [unique] [required (but may be null)]

    the gene description used for gene reports

    OFFICIAL: validated by the appropriate nomenclature committee

    PREFERRED: interim selected for display

    [NOTES--If the symbol is official, the gene_name will be official.

    No record will have both official AND interim nomenclature.

[OFFICIAL|PREFERRED]_SYMBOL:

    [alphanumeric] [unique] [required]

    the symbol used for gene reports

    OFFICIAL: validated by the appropriate nomenclature committee

    PREFERRED: interim option selected for display

    na is used for models without evidence

PREFERRED_PRODUCT:

    [alphanumeric] [unique] [optional]

    the name of the product used in the RefSeq record

Relevance Judgments

Due to resource constraints, the relevance judgments for the 2003 track consist of GeneRIFs from the NLM's LocusLink database. Track participants are not allowed to use GeneRIF data to augment their queries. While we recognize that GeneRIFs are, like the rest of LocusLink, publicly available, we will be working on the honor system of not using GeneRIF data.

We will calculate recall and precision in the classic IR way, using the preferred TREC statistic of mean average precision (average precision at each point a relevant document is retrieved, also called MAP). This will be done in the standard TREC way of participants submitting their results in the format for input to Chris Buckley’s trec_eval program. The code for trec_eval is available at ftp://ftp.cs.cornell.edu/pub/smart/ . There are several versions of trec_eval, which differ mainly in the statistics they calculate in their output. We recommend the following version of trec_eval, which should compile with any C compiler: ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar.

Please note a few "quirks" about trec_eval:

It uses some Unix-specific system calls so would require considerable modification to run on another platform.
The aggregate statistics that are presented at the end of the file (averages for precision at points of recall, average precision, etc.) only include queries for which one or more documents were retrieved. Therefore, you should insert a "dummy" document in your output for queries that retrieve no real documents so that your aggregate scores are averaged over all 50 documents.

The trec_eval program requires two files for input. One file is the topic-document output, sorted by each topic and then subsorted by the order of the IR system output for a given topic. This format is required for official runs submitted to NIST to obtain official scoring .

The topic-document ouptut should be formatted by your IR system as follows:

1 Q0 12474524 1  5567      tag1

1 Q0 12513833 2  5543    tag1

1 Q0 12517948 3  5000    tag1

1 Q0 12531694 4  2743    tag1

1 Q0 12545156 5  1456    tag1

2 Q0 12101238 1  3.0    tag1

2 Q0 12527917 2  2.7    tag1

3 Q0 11731410 1  .004     tag1

3 Q0 11861293 2  .0003    tag1

3 Q0 11861295 3  .0000001  tag1

where:

The first column is the topic number (1-50).
The second column is the query number within that topic. This is currently unused and must always be Q0.
The third column is the official PubMedID of the retrieved document.
The fourth column is the rank the document is retrieved
The fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order. The trec_eval program ranks documents based on the scores, not the ranks in column 4. If a submitter wants the exact ranking submitted to be evaluated, then the SCORES must reflect that ranking.
The sixth column is called the "run tag" and must be a unique identifier across all runs submitted to TREC. Thus, each run tag should have a part that identifies the group and a part that distinguishes runs from that group. Tags are restricted to 12 or fewer letters and numbers, and *NO* punctuation, to facilitate labeling graphs and such with the tags.

The second file required for trec_eval is the relevance judgments, which are called "qrels" in TREC jargon. More information about qrels can be found at http://trec.nist.gov/data/qrels_eng/ . The qrels file is in the following format:

1    0    12474524    1

1    0    12513833    1

1    0    12517948    1

1    0    12531694    1

1    0    12545156    1

2    0    12101238    1

2    0    12527917    1

3    0    11731410    1

3    0    11861293    1

3    0    11861295    1

3    0    12080468    1

3    0    12091359    1

3    0    12127395    1

3    0    12203785    1

where:

The first column is the topic number (1-50).
The second column is always 0.
The third column is the PubMedID of the document.
The fourth column is always 1.

Training data issues

The group from Oregon & Health Science University performed an analysis of relevance for 10 queries from the training data. This preliminary work validates our basic assumption that an article pointed to by a GeneRIF is likely to be relevant in the classic IR sense but here are many “false negatives” (i.e., articles that are relevant but do not have a GeneRIF designation). We plan to carry out a similar analysis for all of the queries in the test data after all results have been submitted.

We also discovered some "quirks" with the training data qrels:

A number of qrels represent documents not present in the document collection.
Three topics have no qrels in the document collection: 21, 35, and 49.

We made sure these problems did not occur with the test data. We also only used gene names that had a minimum of three qrels in the collection for the test topics.

Secondary task

There is much interest in the bioinformatics community in information extraction. This comes in part from the desire to allow scientists to learn about new topics as quickly as they can, preferably without having to read and synthesize many papers. So we will begin explorations of this as well. The particular task in 2003 will be to reproduce the GeneRIF annotation. This task will be more of an exploratory nature and have no "gold standard," but instead allow groups to attempt the task and compare their methods and results. One possibility is to calculate some sort of overlap measure between words that research groups nominate for annotation with those actually selected by NLM. A problem, however, is that while some GeneRIF snippets are direct quotations from article abstracts, others are paraphrased. Furthermore, there can be other legitimate references to basic gene biology beyond the official GeneRIF snippet. An analysis by Jim Mork and Lan Aronson of NLM found that 95% of GeneRIF snippets contained some text from the title or abstract of the article. About 42% of the matches were direct "cut and paste" from the title or abstract, 25% contained significant runs of words from pieces of the title or abstract.

The goal of the secondary task is to reproduce the GeneRIF from the MEDLINE record. Because of the exploratory nature of this task, we have not provided any training data. Groups are asked to use automated approaches and describe them frankly in their reports. Likewise, we do not have a single performance measure, but rather a suite of them.

Secondary task data

The data for the secondary task consist of 139 GeneRIFs representing all of the articles appearing in five journals (Journal of Biological Chemistry, Journal of Cell Biology, Nucleic Acids Research, Proceedings of the National Academy of Sciences, and Science) during the latter half of 2002. The GeneRIFs are formatted as follows:

1    355    12107169    J Biol Chem 2002 Sep 13;277(37):34343-8.    the death effector domain of FADD is involved in interaction with Fas.

2    355    12177303    Nucleic Acids Res 2002 Aug 15;30(16):3609-14.    In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis

where:

The first column is the TREC ID
The second column is the LocusLink ID for the gene
The third column is the PubMed ID
The fourth column is the source of the article
The fifth column is the GeneRIF text

We also have the full text of these articles from Highwire Press, who has obtained permission for us to use them from the publishers. These are available on the protected portion of the TREC Web site.

Secondary Task Performance Measures

The original plan for assessing the secondary task was to use the Dice coefficient, which measures overlap of two strings. That is, the overlap between the candidate GeneRIF and actual GeneRIF would be calculated.

For two strings A and B, define X as the number of words in A, Y as the number of words in B, and Z as the number of words occurring in both A and B. The Dice coefficient is measured by:
Dice (A, B) = (2 * Z)/(X + Y)

It quickly became apparent that this measure was quite limited. It did not, for example, perform any “normalization” of words, such as stop word removal or stemming. It also did not give any credit for words occurring more than once in both strings. Finally, it assumed the strings were simply bags of words and did account for word order or phrases.

Marti Hearst and her student Presley Nakov developed four derivatives of the classic Dice measurement for the task. They developed Perl code to calculate them that my team enhanced to work with the full set of GeneRIFs. This code is available on the active participants portion of the TREC Web site. There are four measures that the code calculates:

Classic Dice with stop words and stemming - The basic measure is the classic Dice formula using a common stop word list and the Porter stemming algorithm.
Modified Unigram Dice - The next measure gives added weight to terms that occur multiple times in both strings. In particular, each set of words in a string is multi-set, with the number of co-occurring words measured by the minimum number of co-occurences.
Bigram Dice - This measure gives some additional weight to proper word order. Instead of measuring the Dice coefficient on single words, it measures it on bigrams.
Bigram Phrases - Bigrams do not always represent legitimate phrases. Stop words like articles and prepositions sometimes occur between content words such that straight bigrams of content words do not represent legitimate phrases. A further measure therefore only includes bigrams that have not had intervening stop words filtered.

The Perl code is in a WinZip file that contains six files:

calcDice.pl - The main Perl program to run the program, which is invoked from the command line with following syntax:

calcDice.pl SecondaryTaskTextFile CandidateTextFile [-v]

where SecondaryTaskTextFile is the properly formatted file of actual GeneRIFs, CandidateTextFile is your file of GeneRIF candidates, and -v is an optional switch to make the output more verbose.

calc_dice.pm - Perl module with code for actual Dice coefficient and derivative measures calculations.
porter.pm - Perl module for Porter stemming algorithm
stoplist.txt - list of stop words for program
sec.txt - properly formatted file of actual GeneRIFs
cand.txt - sample file of candidate GeneRIFs, which are actually the titles of articles (It is interesting to see how much - or little! - these overlap with the GeneRIFs.)

The code outputs a file called report.txt, which lists the four measures for each GeneRIF and an aggregate (mean) calculation of each. WARNING: The code does relatively little error checking.

The file format for the actual and candidate GeneRIFs file is as follows:

1    the death effector domain of FADD is involved in interaction with Fas.

2    In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis

where:

The first column is the TREC ID
The second column is the candidate GeneRIF text