TREC 2003 Genomics Track - Protocol
William Hersh, Track Chair
Oregon Health & Science University
hersh@ohsu.edu
Last updated - November 30, 2008
This document is the final official protocol for the TREC
2003 Genomics Track. The track consisted of a
primary and secondary task. The primary
task was a conventional ad hoc retrieval task with
success measured by recall-precision. The
secondary task explored information extraction
from article
abstracts. (NOTE: The LocusLink database used in this track has been
superceded by EntrezGene.
Some important URLs
Text Retrieval Conference (TREC) -
http://trec.nist.gov /
TREC Genomics Track -
http://ir.ohsu.edu/genomics/
Getting started
The following underlined texts shows the names of
the
links that provide access to various parts of the data from the
protected area of this Web site:
- Genomics Track Homepage links
to this site.
- Primary Task Documents links to a page that provides
instructions on what must be done to obtain the document
collection. You must complete the data use form and send it to
NIST,
after which you will receive a password to access the document
collection from this page.
- Primary Task - Training Relevance Judgments (qrels) links
to the relevance judgments for the training data. This file
consists of GeneRIFs from LocusLink for the 50 training topics.
- Primary Task - Training Topics links to the training
topics. This file consists of gene names from LocusLink.
- Primary Task - Test Topics links to the official test
topics. This file also consists of gene names from LocusLink.
- Primary Task - Test Relevance Judgments (qrels) links to
the relevance judgments for the test data. This file consists of
GeneRIFs from LocusLink for the 50 test topics.
- Secondary Task Data links to the data for the secondary
task.
- Secondary Task Optional Full-Text Documents links to the
full-text of the 139 articles referenced by the GeneRIFs in the
secondary task. (The use of these is optional but they are there
if you want them.)
- Secondary Task Perl code is a WinZip file containing the
code and sample files for calculating results for Dice coefficient and
derivative measures.
General principles
A number of general principles have emerged concerning the 2003 track
from the various workshops and other discussions held so far:
- We want to use document and topic sets large enough in scope to
be meaningful from an IR standpoint.
- We want the first year to be constrained enough
so as not to require a great deal of resources on the part of
NIST or the track.
- Our initial task will focus on document/text retrieval.
- Our system user is considered a biology researcher at the level
of an established researcher seeking information in a new area or a
graduate student wanting to find information about a topic.
- We will try to develop tasks/topics that reflect the real
information needs of biology researchers.
- If possible, we will try to make use of existing resources for
relevance judgments in the first year, given our constraints.
- Our first year’s work will also include identifying or applying
for resources that will enable us to develop the experimental milieu
that we need to carry out meaningful research.
Most of our decisions for the 2003 track emanate from the choice of
resource we plan to use for relevance judgments, which is the Gene
References into Function (GeneRIF) field of
LocusLink.
GeneRIFs are relatively new, but are thought by consensus to be
the most consistent and comprehensive resource for relevance judgments
available short of developing our own. Their major limitation is
that they have only been systematically applied by NLM indexers since
around 4/1/2002. There are other possible options, such as the
literature annotations in some of the other model organisms, in
particular mouse and yeast, but we have decided to go with GeneRIFs for
now.
Primary task
As noted above, the primary task will consist
of ad hoc document retrieval. This type of task requires
a document collection, topics, and relevance judgments. For
those wanting a general overview of the general approach of TREC, we
suggest the
overview from the TREC 2002 conference .
Documents
The document collection consists of 525,938 MEDLINE records where
indexing has been completed between 4/1/2002 and 4/1/2003. The
data are available on the TREC Web site, which is password-protected
and limited to only those who have signed up
to participate in TREC. The MEDLINE records are in a single
compressed
text file. (A 10,000-record subset is also available to those
wanting to get started with a smaller collection of data.) The
following files will be available:
- trec-medline.gz - full collection of 525,938 MEDLINE records for
track
- trec-medline10k.gz - 10,000-record subset of MEDLINE
The data format is the standard NLM MEDLINE format and XML. In
the NLM format, the delimiter between records is a blank line.
The fields are indicated by their 2-3
letter abbreviation. The fields likely to be most important to
track participants are: PubMed Unique Identifier (PMID),
title (TI), abstract (AB), and MeSH headings (MH). A description
of all the fields in a MEDLINE record can be found in the PubMed help
file at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#MEDLINEDisplayFormat
Particpating groups must load the data into an IR system to perform the
experiments. This will likely involve reformatting the data to
meet the input format of the system they choose to use. Although
MEDLINE data are publicly available on the Internet, participants
must remember that MEDLINE records from this collection must not be
posted or otherwise made available on a public Web site , since
this subset is an incomplete version of the MEDLINE database. A
common system used for IR experimentation you may want to consider
using is SMART, which is available at
ftp://ftp.cs.cornell.edu/pub/smart/ .
Topics
The topics consist of gene names, with the specific task being as
follows:
- For gene X, find all MEDLINE references that focus on the basic
biology of the gene or its protein products from the designated
organism. Basic biology includes isolation, structure, genetics
and function of genes/proteins in normal and disease states.
We are distributing training and test topic sets of 50 genes
each. The training data are provided for groups to get an idea of
what the data in this year's track are like. The test data are
the topics for the official runs in the track. For each set of 50
topics, we have randomly chosen gene names that are distributed across
the spectrum of organisms, the number of GeneRIFs (many to few), name
types (see below), and whether or not the gene names are Medical
Subject Heading (MeSH) indexing terms. The training data includes
a file of GeneRIFs that comprise the relevance judgments for these
topics. (See below for more details.)
Gene names
The file of gene names has the following format:
3 2120 Homo sapiens OFFICIAL_GENE_NAME ets variant gene 6 (TEL oncogene)
3 2120 Homo sapiens OFFICIAL_SYMBOL ETV6
3 2120 Homo sapiens ALIAS_SYMBOL TEL
3 2120 Homo sapiens PREFERRED_PRODUCT ets variant gene 6
3 2120 Homo sapiens PRODUCT ets variant gene 6
3 2120 Homo sapiens ALIAS_PROT TEL1 oncogene
4 2252 Homo sapiens OFFICIAL_GENE_NAME fibroblast growth factor 7 (keratinocyte growth factor)
4 2252 Homo sapiens OFFICIAL_SYMBOL FGF7
4 2252 Homo sapiens ALIAS_SYMBOL KGF
4 2252 Homo sapiens ALIAS_SYMBOL HBGF-7
4 2252 Homo sapiens PREFERRED_PRODUCT fibroblast growth factor 7 precursor
4 2252 Homo sapiens PRODUCT fibroblast growth factor 7 precursor
4 2252 Homo sapiens ALIAS_PROT keratinocyte growth factor
where:
- The first column is the official topic number (1-50).
- The second column contains the LocusLink ID for
the gene.
- The third column contains the name of organism.
- The fourth column contains the gene name type.
- The fifth column contains the gene name.
There are four possible organisms:
- Homo sapiens - human
- Mus musculus - mouse
- Rattus norvegicus - rat
- Drosophila melanogaster - fruit fly
There are five possible name types:
- OFFICIAL_GENE_NAME
- PREFERRED_GENE_NAME
- OFFICIAL_SYMBOL
- PREFERRED_SYMBOL
- PREFERRED_PRODUCT
The rules for naming (from the LocusLink help
file ) are as follows:
[OFFICIAL|PREFERRED]_GENE_NAME:
[alphanumeric] [unique] [required (but may be null)]
the gene description used for gene reports
OFFICIAL: validated by the appropriate nomenclature committee
PREFERRED: interim selected for display
[NOTES--If the symbol is official, the gene_name will be official.
No record will have both official AND interim nomenclature.
[OFFICIAL|PREFERRED]_SYMBOL:
[alphanumeric] [unique] [required]
the symbol used for gene reports
OFFICIAL: validated by the appropriate nomenclature committee
PREFERRED: interim option selected for display
na is used for models without evidence
PREFERRED_PRODUCT:
[alphanumeric] [unique] [optional]
the name of the product used in the RefSeq record
Relevance Judgments
Due to resource constraints, the relevance judgments for the 2003 track
consist of GeneRIFs from the NLM's LocusLink database. Track
participants are not allowed to use GeneRIF data to augment their
queries. While we recognize that GeneRIFs are, like the rest of
LocusLink, publicly available, we will be working on the honor
system of not using GeneRIF data.
We will calculate recall and precision in the
classic IR way, using the preferred TREC statistic of mean average
precision (average precision at each point a relevant document
is retrieved, also called MAP). This will be done in the standard
TREC way of participants submitting their results in the format for
input to Chris Buckley’s trec_eval program. The code for
trec_eval is available at
ftp://ftp.cs.cornell.edu/pub/smart/ . There are several
versions of trec_eval, which differ mainly in the statistics they
calculate in their output. We recommend the following version of
trec_eval, which should compile with any C compiler:
ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar.
Please note a few "quirks" about trec_eval:
- It uses some Unix-specific system calls so would require
considerable modification to run on another platform.
- The aggregate statistics that are presented at the end of the
file (averages for precision at points of recall, average precision,
etc.) only include queries for which one or more documents were
retrieved. Therefore, you should insert a "dummy" document in
your output for queries that retrieve no real documents so that your
aggregate
scores are averaged over all 50 documents.
The trec_eval program requires two files for input. One file is
the topic-document output, sorted by each topic and then subsorted by
the order of the IR system output for a given topic. This
format is required for official runs submitted to NIST to obtain
official scoring .
The topic-document ouptut should be formatted by your IR system as
follows:
1 Q0 12474524 1 5567 tag1
1 Q0 12513833 2 5543 tag1
1 Q0 12517948 3 5000 tag1
1 Q0 12531694 4 2743 tag1
1 Q0 12545156 5 1456 tag1
2 Q0 12101238 1 3.0 tag1
2 Q0 12527917 2 2.7 tag1
3 Q0 11731410 1 .004 tag1
3 Q0 11861293 2 .0003 tag1
3 Q0 11861295 3 .0000001 tag1
where:
- The first column is the topic number (1-50).
- The second column is the query number within that topic. This is
currently unused and must always be Q0.
- The third column is the official PubMedID of the retrieved
document.
- The fourth column is the rank the document is retrieved
- The fifth column shows the score (integer or floating point) that
generated the ranking. This score MUST be in descending
(non-increasing) order. The trec_eval program ranks documents
based on the scores, not the ranks in column 4. If a submitter
wants the exact ranking submitted to be evaluated, then the SCORES must
reflect that ranking.
- The sixth column is called the "run tag" and must be a unique
identifier across all runs submitted to TREC. Thus, each run tag should
have a part that identifies the group and
a part that distinguishes runs from that group. Tags are
restricted to 12 or fewer letters and numbers, and *NO* punctuation, to
facilitate labeling graphs and such with the tags.
The second file required for trec_eval is the relevance judgments,
which are called "qrels" in TREC jargon. More
information about qrels can be found at
http://trec.nist.gov/data/qrels_eng/ . The qrels file is in
the following format:
1 0 12474524 1
1 0 12513833 1
1 0 12517948 1
1 0 12531694 1
1 0 12545156 1
2 0 12101238 1
2 0 12527917 1
3 0 11731410 1
3 0 11861293 1
3 0 11861295 1
3 0 12080468 1
3 0 12091359 1
3 0 12127395 1
3 0 12203785 1
where:
- The first column is the topic number (1-50).
- The second column is always 0.
- The third column is the PubMedID of the document.
- The fourth column is always 1.
Training data issues
The group from Oregon & Health Science University performed an
analysis of relevance for 10 queries
from the training data. This preliminary work validates our basic
assumption that an article pointed to by a GeneRIF is likely to be
relevant in the classic IR sense but here are many “false negatives”
(i.e., articles that are relevant but do not have a GeneRIF
designation). We plan to carry out a similar analysis for all of
the queries in the test data after all results have been submitted.
We also discovered some "quirks" with the training data qrels:
- A number of qrels represent documents not present in the document
collection.
- Three topics have no qrels in the document collection: 21,
35, and 49.
We made sure these problems did not occur with the test data. We
also only used gene names that had a minimum of three qrels in the
collection for the test topics.
Secondary task
There is much interest in the bioinformatics community in information
extraction. This comes in part from the desire to allow
scientists to learn about new topics as quickly as they can, preferably
without having to read and synthesize many papers. So we
will begin explorations of this as well. The particular task
in 2003 will be to reproduce the GeneRIF annotation. This task
will be more of an exploratory nature and have no "gold standard," but
instead allow groups to attempt the task and compare their methods and
results. One possibility is to calculate some sort of overlap
measure between words that research groups nominate for annotation with
those actually selected by NLM. A problem, however, is that while
some GeneRIF snippets are direct quotations from article abstracts,
others are paraphrased. Furthermore, there can be other
legitimate references to basic gene biology beyond the official GeneRIF
snippet. An analysis by Jim Mork and Lan Aronson of NLM found
that 95% of GeneRIF snippets contained some text from the title or
abstract of the article. About 42% of the matches were direct
"cut and paste" from the title or abstract, 25% contained significant
runs of words from pieces of the title or abstract.
The goal of the secondary task is to reproduce the GeneRIF from
the MEDLINE record. Because of the exploratory nature of this
task,
we have not provided any training data. Groups are asked to use
automated approaches and describe them frankly in their reports.
Likewise, we do not have a single performance measure, but rather
a suite of them.
Secondary task data
The data for the secondary task consist of 139 GeneRIFs representing
all of the articles appearing in five journals (Journal
of Biological Chemistry, Journal of Cell Biology, Nucleic Acids
Research,
Proceedings of the National Academy of Sciences, and Science) during
the
latter half of 2002. The GeneRIFs are formatted as follows:
1 355 12107169 J Biol Chem 2002 Sep 13;277(37):34343-8. the death effector domain of FADD is involved in interaction with Fas.
2 355 12177303 Nucleic Acids Res 2002 Aug 15;30(16):3609-14. In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis
where:
- The first column is the TREC ID
- The second column is the LocusLink ID for the gene
- The third column is the PubMed ID
- The fourth column is the source of the article
- The fifth column is the GeneRIF text
We also have the full text of these articles from Highwire Press, who
has obtained permission for us to use them from the publishers.
These are available on the protected portion of the TREC Web site.
Secondary Task Performance Measures
The original plan for assessing the secondary task was to use the Dice
coefficient, which measures overlap of two strings. That is, the
overlap between the candidate GeneRIF and actual GeneRIF would be
calculated.
For two strings A and B, define X as the number of words in A, Y
as the number of words in B, and Z as the number of words occurring in
both A and B. The Dice coefficient is measured by:
Dice (A, B) = (2 * Z)/(X + Y)
It quickly became apparent that this measure was quite limited.
It did not, for example, perform any “normalization” of words, such as
stop word removal or stemming. It also did not give any credit
for words occurring more than once in both strings. Finally, it
assumed the strings were simply bags of words and did account for word
order or phrases.
Marti Hearst and her student Presley Nakov developed four derivatives
of the classic Dice measurement for the task. They developed Perl
code to calculate them that my team enhanced to work with the full set
of GeneRIFs. This code is available on the active participants
portion of the TREC Web site. There are four measures that the
code calculates:
- Classic Dice with stop words and stemming - The basic
measure is the classic Dice formula using a common stop word list and
the Porter stemming algorithm.
- Modified Unigram Dice - The next measure gives added weight
to terms that occur multiple times in both strings. In
particular, each set of words in a string is multi-set, with the number
of co-occurring words measured by the minimum number of co-occurences.
- Bigram Dice - This measure gives some additional weight to proper
word order. Instead of measuring the Dice coefficient on single
words, it measures it on bigrams.
- Bigram Phrases - Bigrams do not always represent legitimate
phrases. Stop words like articles and prepositions sometimes
occur
between content words such that straight bigrams of content words do
not
represent legitimate phrases. A further measure therefore only
includes
bigrams that have not had intervening stop words filtered.
The Perl code is in a WinZip file that contains six files:
- calcDice.pl - The main Perl program to run the program, which
is invoked from the command line with following syntax:
- calcDice.pl SecondaryTaskTextFile CandidateTextFile [-v]
- where SecondaryTaskTextFile is the properly formatted file of
actual GeneRIFs, CandidateTextFile is your file of GeneRIF candidates,
and -v is an optional switch to make the output more verbose.
- calc_dice.pm - Perl module with code for actual Dice coefficient
and derivative measures calculations.
- porter.pm - Perl module for Porter stemming algorithm
- stoplist.txt - list of stop words for program
- sec.txt - properly formatted file of actual GeneRIFs
- cand.txt - sample file of candidate GeneRIFs, which are actually
the titles of articles (It is interesting to see how much - or little!
- these overlap with the GeneRIFs.)
The code outputs a file called report.txt, which lists the four
measures for each GeneRIF and an aggregate (mean) calculation of each.
WARNING: The code does relatively little error checking.
The file format for the actual and candidate GeneRIFs file is as
follows:
1 the death effector domain of FADD is involved in interaction with Fas.
2 In the case of Fas-mediated apoptosis, when we transiently introduced these hybrid-ribozyme libraries into Fas-expressing HeLa cells, we were able to isolate surviving clones that were resistant to or exhibited a delay in Fas-mediated apoptosis
where:
- The first column is the TREC ID
- The second column is the candidate GeneRIF text