TREC 2005 Genomics Track Protocol
William Hersh, Track Chair
Last updated - November 28, 2005
This page contains the protocol for the TREC 2005 Genomics Track.
As
with the 2003 and 2004 tracks, there were two tasks.
Similar to
2004, one of the tasks consisted of
ad hoc retrieval, while the
second involved text categorization. There is a companion data page for the 2005 track that includes
instructions for accessing the data.
Official Runs
For the ad hoc task, 58 official runs were submitted by 32
participating research groups. For the categorization tasks, 192
runs were submitted by 19 groups. The results are reported in the
track overview paper.
Ad Hoc Retrieval Task
In the 2005 ad hoc retrieval task, we employed topics that were
more
structured than the mostly free-form topics from the 2004 track.
The purpose of this approach was to provide systems with better defined
(yet still realistic) queries for finding genomics information.
As such, we developed topics from generic structured templates so
systems could make better use of other resources, such as ontologies or
databases. We also hoped this could serve as the basis to begin
investigation toward an interactive task.
Topics
As with 2004, we collected topics from real biologists.
However, instead of soliciting free-form topics, we provided the
biologists we interviewed with generic templates and asked them to
express
information needs they had recently that fit within those
templates.
While it would have been ideal to interview users and develop the
templates
themselves from such interviews, the time frame of the track did not
allow this. Instead, we developed a set of generic topic
templates (GTTs) derived from an analysis of the topics from the 2004
track and other known biologist information needs. After we
developed the GTTs, 11 people went out interviewed 25 biologists to
obtain instances of
them. We then had other people do some searching on the topics to
make sure there was at least some, although not too much, information
out
there about them. The topics
did not have to fit precisely into the GTTs, but had to come
close (i.e., have all the required semantic types).
As with 2004, there were 50 topics. We reached closure on 5 GTTs,
each of which had 10
instances, for a total of 50 topics. The five GTTs are listed
below. (We had an
extra one in case one did not pan out during the interviews, which
turned out to be the case.) The semantic types in each GTT are
underlined. For some semantic types, more than one
instance is allowed. The five GTTs are:
- Find articles describing standard methods or protocols for
doing some sort of experiment or procedure.
- Find articles describing the role of a gene
involved in a given disease.
- Find articles describing the role of a gene in a specific biological process.
- Find articles describing interactions (e.g.,
promote, suppress, inhibit, etc.) between two or more genes in the function of an organ
or in a disease.
- Find articles describing one or more mutations of a given gene
and its biological impact.
In order to get participating groups started with the topics, and in
order for them not to "spoil" their automatic status of their official
runs by working with the official topics, we developed 10 sample
topics, with two coming from each GTT. Both the sample and
official topics were available in three formats: a Word file
containing the GTTs and their instances (the topics) in tabular form, a
PDF file that is a "printout" of the Word file, and a text file that
has the topics expressed in a narrative form (essentially the GTTs
filled in with the instances). Here are links to the sample
topics:
All of the sample topics had sample searches and associated relevance
judgments for the retrieved articles. The files with these
searches and judgments were posted in the active user portion of the
track Web site. The files were named 9X.txt, where the 9X is the
topic. In the file is the PubMed
search statement that generated the output, which was filtered for the
time period of the MEDLINE subset (1994-2003).
In the
even-numbered files (90, 92, 94, 96, 98), there is a tag DR, PR, or NR before each MEDLINE record
that represents the relevance
judgment. The tag is right before the number of the document in
the PubMed output, e.g., DR1 represents that the first record in the
PubMed output for this particular search was definitely relevant.
In the odd-numbered files (91, 93, 95, 97, 99), the tag DR, PR, or NR
that represents the relevance
judgment is after
each MEDLINE record.
Please note that neither the searches nor the judgments
should be considered complete. In addition, it is possible that
some MEDLINE records are not in the MEDLINE subset while some that are
in the test collection and should have been retrieved are not
there. These searches and judgments are simply provided "as is"
for those who want to see some for these topics.
The official topics are available on active user portion of the track
Web site. A password is required to access the data that can ONLY
be obtained from Lori Buckland of NIST (please do not email me to ask
for it!). The names of the files
are:
- adhoc2005narrative.txt - Narrative version of official topics
(100-149)
- adhoc2005topics.doc - Tabular (Microsoft Word) version of
official topics (100-149)
- adhoc2005topics.pdf - Tabular (PDF) version of official topics
(100-149)
Documents
The document collection for the 2005 ad hoc retrieval task was the
same 10-year MEDLINE subset using for the 2004 track. One goal we
had was
to produce a number of topic and relevance judgment collections that
use this same document collection to make retrieval experimentation
easier (so people do not have to load different collections into their
systems). More uses of this subset will be forthcoming later.
More detail about the document collection is available on the 2004
protocol page and 2004 data page,
although highly pertinent information is reproduced
here. The document collection for the task is a 10-year subset of
the MEDLINE
bibliographic database of the biomedical literature. MEDLINE can
be searched by anyone in the world using the PubMed system of the National Library of Medicine (NLM),
which maintains both MEDLINE and PubMed. The full MEDLINE
database contains over 13 million references dating back to 1966 and is
updated on a daily basis.
The subset of MEDLINE for the TREC 2005 Genomics Track
consists of 10 years of completed citations from the database
inclusive
from 1994 to 2003. Records were extracted using the Date
Completed (DCOM) field for all references in the range of 19940101 -
20031231. This provided a total of 4,591,008
records, which is about one third of the full MEDLINE database. A
gzipped list of the PubMed IDs (PMIDs) is available (10.6 megabytes
compressed).
The data includes all of the PubMed fields identified in the MEDLINE
Baseline record and only the PubMed-centric tags are removed from the
XML version. A description of the various fields of MEDLINE are
available at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#MEDLINEDisplayFormat
It should also be noted that not all MEDLINE records have
abstracts, usually because the article itself does not have an
abstract. In general, about 75% of MEDLINE records have
abstracts. In our subset, there are 1,209,243 (26.3%) records
without abstracts.
The MEDLINE subset is available in the "MEDLINE" format, which consists
of ASCII text with fields indicated and delimited by 2-4
character abbreviations. The size of the file uncompressed is
9,587,370,116 bytes, while the gzipped version is
2,797,589,659 bytes. The file can be found with the 2004 files on
the data portion of the Web site. An XML version of MEDLINE
subset is also available; see the 2004 data
page for details.
Groups contemplating using their own MEDLINE collection or filtering
from "live" PubMed should be aware of the following caveats (from Jim
Mork of NLM):
- Some citations may differ due to revisions or corrections.
- Some citations may no longer exist. The bulk of the
citations come from the 2004 Baseline; some may have been removed since
that baseline was created.
- UTF-8 characters have been translated to 7-bit ASCII whereas the
live PubMed system provides UTF-8 data and in some cases puts in
characters like the inverted question mark.
- The XML and ASCII files have been modified to conform to the
MEDLINE Baseline format. This is a subtle difference but one that
may cause changes in the files.
Relevance Judgments
Relevance judgments were done similar to the TREC 2004 Genomics
Track and other TREC tracks using the "pooling" method, where the topic
ranking documents from each group's best run were pooled and given
to a judge with expertise in biology. The relevance judges were
instructed in the following manner for each GTT:
- Relevant article must describe how to conduct, adjust, or improve
a standard, a, new method, or a protocol for doing some sort of
experiment or procedure.
- Relevant article must describe some specific role of the gene
in the stated disease.
- Relevant article must describe some specific role of the gene
in the stated biological process.
- Relevant article must describe a specific interaction (e.g.,
promote, suppress, inhibit, etc.) between two or more genes in the
stated function of the organ
or the disease.
- Relevant article must describe a mutation of the stated gene
and the particular biological impact(s) that the mutation has been
found to have.
In general, the articles had to describe a specific gene, disease,
impact, mutation, etc. and not the concept generally.
Submitted Runs
We
collected other data about submitted runs besides the system
output. One item was the run
type, which fell into one of (at least) three
categories:
- Automatic - no manual intervention in building queries
- Manual - manual construction of queries but no further human
interaction
- Interactive - completely interactive construction of queries and
further interaction with system output
Recall and precision for the ad hoc retrieval task wree calculated
in
the classic IR way, using the preferred TREC statistic of mean average
precision (average precision at each point a relevant document is
retrieved, also called MAP). This was done using the
trec_eval program. The code for
trec_eval is available at http://trec.nist.gov/trec_eval/trec_eval.7.3.tar.gz.
The trec_eval program requires two files for input. One file is
the topic-document output, sorted by each topic and then subsorted by
the order of the IR system output for a given topic. This
format is required for official runs submitted to NIST to obtain
official scoring.
The topic-document ouptut should be formatted as follows:
100 Q0 12474524 1 5567 tag1
100 Q0 12513833 2 5543 tag1
100 Q0 12517948 3 5000 tag1
101 Q0 12531694 4 2743 tag1
101 Q0 12545156 5 1456 tag1
102 Q0 12101238 1 3.0 tag1
102 Q0 12527917 2 2.7 tag1
103 Q0 11731410 1 .004 tag1
103 Q0 11861293 2 .0003 tag1
103 Q0 11861295 3 .0000001 tag1
where:
- The first column is the topic number (100-149) for the 2005
topics.
- The second column is the query number within that topic.
This is currently unused and must always be Q0.
- The third column is the official PubMedID of the retrieved
document.
- The fourth column is the rank the document is retrieved
- The fifth column shows the score (integer or floating point) that
generated the ranking. This score MUST be in descending
(non-increasing) order. The trec_eval program ranks documents
based on the scores, not the ranks in column 4. If a submitter
wants the exact ranking submitted to be evaluated, then the SCORES must
reflect that ranking.
- The sixth column is called the "run tag" and must be a unique
identifier across all runs submitted to TREC. Thus, each run tag
should have a part that identifies the group and a part that
distinguishes runs from that group. Tags are restricted to 12 or
fewer letters and numbers, and *NO* punctuation, to facilitate labeling
graphs and such with the tags.
The second file required for trec_eval is the relevance judgments,
which are called "qrels" in TREC jargon. More information about
qrels can be found at http://trec.nist.gov/data/qrels_eng/
. The qrels file is in the following format:
100 0 12474524 1
101 0 12513833 1
101 0 12517948 1
101 0 12531694 1
101 0 12545156 1
102 0 12101238 1
102 0 12527917 1
103 0 11731410 1
103 0 11861293 1
103 0 11861295 1
103 0 12080468 1
103 0 12091359 1
103 0 12127395 1
103 0 12203785 1
where:
- The first column is the topic number (100-149) for the 2005
topics.
- The second column is always 0.
- The third column is the PubMedID of the document.
- The fourth column is always 1.
Categorization Task
The second task for the 2005 track was a categorization task. It
was similar in part to the 2004 categorization task in that it used
data
from the
Mouse Genome Informatics (MGI) system and was a document triage
task. It included another
running of one subtask from 2004, the triage of articles for GO
annotation, and added triage of articles for three other
major types of information collected and
catalogued by MGI.
These included articles about:
- Tumor biology
- Embryologic gene expression
- Alleles of mutant phenotypes
As such, the categorization task looked at how well systems could
categorize documents for four categories (the listed three plus plus GO
annotation). We used the same utility
measure used last year but with different parameters (see below).
We created an updated version of the cat_eval program that
calculates
the utility measure plus recall, precision, and the F score. We
calculated utility for each of the four categorization tasks separately.
For more information about the MGI systems and its components that we
will be triaging documents from, consult the following references (not
all of these are freely available on the Web):
- Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, and the
members of the Mouse Genome Database Group. 2005. The
Mouse Genome Database (MGD): from genes to mice--a community resource
for mouse biology. Nucleic Acids
Res 2005; 33: D471-D475.
- Strivens M, Eppig JT. 2004. Visualizing
the laboratory mouse: capturing phenotype information. Genetica 122: 89-97.
- Hill DP, Begley DA, Finger JH, Hayamizu TF, McCright IJ, Smith
CM, Beal JS, Corbani LE, Blake JA, Eppig JT, Kadin JA, Richardson JE,
Ringwald M. The
Mouse Gene Expression Database (GXD): updates and enhancements. Nucleic Acids Res 2004;
32:D568-D571.
- Näf D, Krupke DM, Sundberg JP, Eppig JT, Bult CJ. 2002. The
Mouse Tumor Biology database: a public resource for cancer genetics and
pathology of the mouse. Cancer
Res 62(5):1235-40.
Documents
The documents for the 2004 categorization tasks consisted of the
same full-text articles used in 2003. The articles came from
three journals
over two years, reflecting the full-text data we were able to
obtain from Highwire Press: Journal
of Biological Chemistry (JBC), Journal
of Cell Biology (JCB), and Proceedings
of the National Academy of Science (PNAS). These
journals have a good proportion of mouse genome articles. Each
of the papers from these journals is in SGML format.
Highwire's DTD
and its documentation are available. Also
the same as 2004, we designated 2002 articles as training data and 2003
articles as test
data. The documents for the tasks came from a subset of these
articles that had the words "mouse" or "mice" or "mus" as described in
the 2004 protocol. A
crosswalk or look-up table was provided that matches an identifier for
each Highwire article (its file name) and its corresponding PubMed ID
(PMID). The table below shows the total number of articles and
the number in the subset the track used.
Journal |
2002 papers - total, subset
|
2003 papers - total, subset
|
Total papers - total, subset
|
JBC
|
6566, 4199
|
6593, 4282
|
13159, 8481
|
JCB
|
530, 256
|
715, 359
|
1245, 615
|
PNAS
|
3041, 1382
|
2888, 1402
|
5929, 2784
|
Total papers
|
10137, 5837
|
10196, 6043
|
20333, 11880
|
The following table lists the the files containing the documents and
related data. The files can be found on the active user portion
of the Web site.
File contents
|
Training data file name
|
Test data file name
|
Full-text document collection in
Highwire SGML format (in 2004 data directory)
|
train.tar.Z |
test.tar.Z |
Crosswalk files of PMID,
Highwire file name of article,
journal code, and year of paper publication (in 2005 data directory)
|
train.crosswalk.txt |
test.crosswalk.txt |
The SGML training document collection is 150 megabytes in size
compressed and 449 megabytes uncompressed. The SGML test document
collection is 140 megabytes compressed and 397 megabytes
uncompressed. Many gene names have Greek or other
non-English characters, which can present a problem for those
attempting
to recognize gene names in the text. The Highwire SGML appears to
obey the rules posted on the NLM Web site with regards to these
characters (http://www.ncbi.nlm.nih.gov/entrez/query/static/entities.html).
Evaluation measures
While used the utility measure as the primary
evaluation measure but in a slightly different way in
2005. This was because there are varying numbers of positive
examples for the four different categorization tasks.
The framework for evaluation in the categorization task was based on
the
following table of possibilities:
|
Relevant (classified)
|
Not relevant (not classified)
|
Total
|
Retrieved
|
True positive (TP)
|
False positive (FP)
|
All retrieved (AR)
|
Not retrieved
|
False negative (FN)
|
True negative (TN)
|
All not retrieved (ANR)
|
|
All positive (AP)
|
All negative (AN)
|
|
The measure for evaluation was the utility measure often applied
in
text categorization research and used by the former TREC Filtering
Track. This measure contains coefficients for the
utility of retrieving a relevant and retrieving a nonrelevant document.
We used a version that was normalized by the best possible
score:
Unorm = Uraw / Umax
For a test collection of documents to categorize, Uraw is
calculated as follows:
Uraw = (ur * TP)
+ (unr * FP)
where:
- ur = relative utility of relevant document
- unr = relative utility of nonrelevant document
For our purposes, we assume that unr = -1 and solve for ur
using MGI's current practice of triaging everything:
0.0 = ur*AP - AN
ur = AN/AP
AP and AN were different for each task, as shown in the following table:
TASK |
TRAIN/TEST |
AP |
AN |
N |
ur |
A (alelle) |
TRAIN |
338 |
5499 |
5837 |
16.27 |
A (alelle) |
TEST |
332 |
5711 |
6043 |
17.20 |
E (expression) |
TRAIN |
81 |
5756 |
5837 |
71.06 |
E (expression) |
TEST |
105 |
5938 |
6043 |
56.55 |
G (GO annotation) |
TRAIN |
462 |
5375 |
5837 |
11.63 |
G (GO annotation) |
TEST |
518 |
5525 |
6043 |
10.67 |
T (tumor) |
TRAIN |
36 |
5801 |
5837 |
161.14 |
T (tumor) |
TEST |
20 |
6023 |
6043 |
301.15 |
(Yes, the numbers for GO annotation were different from the 2004
data. This is because additional articles were triaged by
MGI since we collected the data last year.)
The ur's for A and G are fairly close across the training
and test collections, the ur's for E and especially T vary
much more. We therefore established a ur that was
the average of that computed for the training and test collection,
rounded to the nearest whole number. That resulted in this set of ur's
for each task:
TASK |
ur |
A (alelle) |
17 |
E (expression) |
64 |
G (GO annotation) |
11 |
T (tumor) |
231 |
In order to facilitate calculation of the modified version of the
utility measure for the 2005 track, we updated the cat_eval
program to version 2.0, which included a command-line parameter to set ur.
Here is documentation for the program:
The program has been tested on Windows and Solaris but should run on
just about any OS where you can install Python. Similar to
trec_eval,
there is very little error-checking for proper data format, so make
sure your data files are in the format specified on the track protocol
page. If you have any issues, please email Aaron Cohen at cohenaa@ohsu.edu.
Installation:
For cat_eval2.py, you need Python 2.3 or above installed. Install
as you would any script.
For cat_eval2.zip unzip to a directory on a Windows machine and execute
cat_eval2.exe, no Python installation necessary.
Usage: cat_eval.py data_file gold_standard_file Urelevant [-tab]
The required Urelevant parameter needs to be set to the proper weight
for each of the sub-tasks.
The optional -tab argument makes the script output results in tab
delimited format instead of the console readable format.
There is one sample output file for each sub-task in the active user
portion of the track Web site:
- Allele task: sample.Atrain.txt
- Expression task: sample.Etrain.txt
- GO annotation task: sample.Gtrain.txt
- Tumor task: sample.Ttrain.txt
The cat_eval program expects a separate file for each run of each task,
where the file has three tab-separated fields:
triageG 12189157 sample
triageG 12189154 sample
triageG 12393878 sample
triageG 12451176 sample
triageG 12209011 sample
triageG 12209014 sample
where:
- The first column is the task, i.e., one of triageA, triageE,
triageG, or triageT.
- The second column is the PMID documents classified as positive
for this task.
- The run tag is a short institution and run identifier that
distinguishes the run of the group. Tags are restricted to 12 or
fewer
letters and numbers, and no
punctuation, to facilitate labeling
graphs and such with the tags.
Here is sample output you should get from the sample Expression
sub-task:
cat_eval.py sample.E+train.txt triage.E+train.txt 64.0
Run: sample
Counts: tp=81; fp=2538; fn=0
Precision: 0.0309
Recall: 1.0000
F-score: 0.0600
Utility Factor: 64.00
Raw Utility: 2646
Max Utility: 5184
Normalized Utility: 0.5104
The official results will calculate Unorm for each of the
four
categories as well as an overall mean of the four.
Participants were strongly
encouraged
to adhere to a naming convention for their run tags that has the first
letter of the tag designating the specific run: a for allele, e
for
expression, g for GO, and t for tumor.
Data
The training data came in four files, one for each category (i.e., A,
E, G, and T). (The fact that three of these four correspond to
the four nucleotides in DNA is purely coincidental!) They were
named as Atrain.txt, Etrain.txt, etc. and are available in the active
user portion of the track Web site. The test data obey the same
naming conventions.
What Resources Can Be Used?
A common question was, what resources can be legitimately used to aid
in
categorizing the documents? In general, we allowed use of
anything,
including resources on the MGI Web site. The only resource particpiants could not
use was the
direct data itself, i.e., data that is directly linked to the PMID or
the associated MGI unique identifier. Thus, they could not go
into
the MGI database (or any other aggregated resource such as Entrez Gene
or SOURCE) and pull out GO codes, tumor terms, mutant phenotypes, or
any other data that was explicitly linked to a document.
But anything else was fair game.
We also made available a cheatsheet
developed by MGI for its curators who triage documents. This
version of the sheet was a couple years old, but given that the
articles
and
data we are using were also that old, this sheet may actually have been
more
appropriate than an up-to-date one would have been. As with the
sample
searches in the ad hoc task, this was provided on an "as is" basis,
with
no guarantees it would be helpful. The appropriate "Areas" on the
sheet that correlate to the categories we were triaging to are:
- Alleles and phenotypes - A
- Expression - E
- Gene Ontology - G
- Tumor - T
Automated tagging of mouse genes in MEDLINE corpus
Aaron Cohen of OHSU processed the entire TREC 2004 10 year Medline
subset with the mouse named-entity recognizer and normalizer (NER+N)
that we presented at BioLINK2005. There are two files in the TREC 2004
genomics data directory that correspond to the two halves of the split
10 year subset:
- 2004_TREC_MEDLINE_1_MGI_FOR_PMID.tsv.txt corresponds to
2004_TREC_MEDLINE_1.gz
- 2004_TREC_MEDLINE_2_MGI_FOR_PMID.tsv.txt corresponds to
2004_TREC_MEDLINE_2.gz
(Note that these files are from the MEDLINE corpus for the ad hoc task,
but all of the full-text Highwire documents are in this corpus and can
be linked using the crosswalk files.)
The files are in tab-separated format, one line for each PMID. The
first field is the PMID of the MEDLINE record, followed by a variable
number of fields which are the MGI identifiers for the genes found by
our system in that record. Only MEDLINE records found to contain one or
more mouse genes are included. The files are about 18 megabytes each.
This could be a useful addition to the corpus. However, the files were
automatically generated and have not been manually reviewed, so some
errors are to be expected. In Aaron's BioLink paper (available at http://acl.ldc.upenn.edu/W/W05/W05-1303.pdf),
he reported a precision of 0.775 at a recall of 0.726 for mouse gene
NER+N on the BioCreative test collection, which is competitive with the
state of the art.