TREC 2007 Genomics Track Protocol
Last updated: November 30, 2008
relevance judgments are posted in the protected area of
the track Web site. An overview
of the results is available.
For the TREC 2007 Genomics Track, we are undertaking a modification of
question answering extraction task used in 2006.
We will continue to task systems with extracting out relevant passages
of text that answer topic questions. However for this year, instead of
categorizing questions by generic topic type (GTT), we will derive
questions based on biologists’ information needs where the answers, in
part, are lists of named entities of a given type. Systems will
still be required to return a passage of text, which will provide one
or more relevant list items in context.
We have gathered new information needs from working biologists. This
was done by modifying the questionnaire used in 2004 to survey several
biologists about recent information needs. In addition to asking about
information needs, biologists were asked if their desired answer is
a list of a certain type of thing, such as genes, proteins, diseases,
mutations, etc. We collected about 50 information
needs statements, of which about 36 will be used as topics and 14 used
as sample topics. A list of entity types is provided below.
Similar to last year, systems will return passages of text.
Relevance judges will assign the relevant passages "answers," or items
belonging to a single entity class, analogous to the assignment of MeSH
aspects in 2006. After pooling the top nominated passages as in past
relevance judges will select relevant passages. Judges will then assign
one or more answer identifiers to each relevant passage.
Passages must contain one or more named entities of the given type with
supporting text that answers the given question to be marked relevant.
Passages will be given credit for each relevant and supported answer.
This is required because it is likely that the expert judges will not
be able to judge the response unless it contains an entity of the type
that they are looking for.
We will use the same full-text document corpus
assembled for last year's Genomics Track. The documents in this
corpus will come from the Highwire
Press electronic distribution of journals and are in HTML format.
There are about 160,000 documents in the corpus from about 49 genomics
The evaluation measures for this task are a refinement of the
measures used in 2006. We will continue to use document MAP as is,
i.e., a document that contains a passage judged relevant is deemed
relevant. We will use a character-based MAP measure like last year to
compare the accuracy of the extracted answers, but will modify it to
treat each individually retrieved character in published order as
relevant or not, in a sort of "every character is a mini
relevance-judged document" approach. This will increase the stability
of the passage MAP measure against arbitrary passage splitting
techniques. The aspect measure will remain the same, except that
instead of using assigned MeSH aspects we will simply use the answer
entities assigned by the relevance judges. Entity-based MAP should be
better than the prior year’s aspect measure for valuing a wider range
of correct answers ranked higher in the retrieval because answer
diversity based on entities should have less overlap between passages
for questions with multiple answers than last year’s MeSH categories.
The documents for this task come from the new full-text
biomedical corpus we have assembed. We have obtained permission from a
number of publishers who use Highwire
Press for electronic distribution
of their journals. They have agreed to allow us to include their
full text in HTML format, which
preserves formatting, structure, table and figure legends, etc.. For
more information on the document collection, see the 2006 track data page. As noted in that
file, there are some issues with the document collection:
As noted in the 2006 protocol, there
are some errors between the PMIDs designated by Highwire and the actual
PMIDs from NLM in MEDLINE. We have identified 1,767 instances
(about 1% of the 162K documents) where the Highwire file PMID is
invalid, in the sense that it returns zero hits when searching it on
PubMed. Some invalid PMIDs are due to the fact that the corresponding
documents represent errata and author responses to comments (e.g.,
author replies to letters). These have been assigned PMIDs in
publisher-supplied data, but NLM generally does not cite them
separately in PubMed, and therefore deleted the PMIDs, although they
remained in publisher data. There are documents already assigned a PMID
submitted by Highwire that NLM, by policy, decided not to index at all,
in which case, again, NLM deleted the PMID, but it was retained in
Highwire data. We also have found instances of invalid PMIDs in
Highwire data for documents that are cited in PubMed but with a
different PMID which is absent from Highwire data; such instances can
be characterized as errors. In any case, we have investigated the
problem of invalid PMIDs and found that for all instances we checked,
the problem was the original Highwire file having an invalid PMID. In
other words, invalid PMIDs are in the Highwire data, not a result of
our processing. For this reason, we have decided not to delete these
files from the collection. They represent, in our view, normal dirty
data, whether due to errors or policy differences between NLM and
- The collection is not complete from the standpoint of each
entire journal. That is, there are many journals where the
articles appeared in the journal, but did not make it into our
collection. (Neither the article nor the MEDLINE record.) This is not
an issue to us, since we view the collection as a closed
and fixed collection.
- Some of the PMIDs in the source data from Highwire Press are
inconsistent with PubMed PMIDs (see next paragraph for explanation).
- Some of the HTML files are empty or nearly empty (i.e., only
contain a small amount of meaningless text). Some of this is due to
errors in our processing, but some is also related to the incorrect
PMID problem of Highwire. We have forzen the collection for now and,
since these files are small, they are unlikely to have any relevant
Since the goal of the task is passage retrieval,
we have developed some
additional data sources that will aid us in managing and evaluating
runs. As noted below, retrieved passages can contain any span of
text that does not include an HTML paragraph tag (i.e., one starting
with <P or
</P). We will also
use these delimiters to extract text
that will be assessed by the relevance judges. Because there has been
much confusion in the discussion about the different types of passages,
we define the following terms:
We note some other things about the maximum-length legal spans:
- Nominated passage
- This is the passage that systems nominate in their runs and will be
scored in the passage retrieval evaluation.
- Maximum-length legal
span - These are all the passages obtained by delimited the text
of each document by the HTML paragraph tags. As noted below, nominated
passages cannot cross an HTML paragraph boundary. So these spans
represent the longest possible passage that can be designated as
relevant. As also noted below, we will build pools of these spans for
the relevance judges. The judges will be given the entire span, even if
no system nominated the entire span. However, the judges do not need to
desigate the entire span as relevant, and may select just a part of the
span to be relevant.
- Relevant passage
- These are the spans that the judges will designate as definitely or
In order to
facilitate our management of the data, and perhaps be of use to
participants, we have created a file, legalspans.txt, which
includes all of the maximum-length legal
spans for the collection. This 215-megabyte file is in the
protected area of the Web
site. Note that the first span includes all of the HTML prior to
the first <p>, which will obviously not be part of any relevant
passage. This file identifies all the maximum-length legal spans in all
documents, which consist of all spans >0 bytes delimited by HTML
paragraph tags. These spans are identified by the byte offset and
length in the HTML file. The index number of the first byte of
the file is 0.
- The first and last spans are
delimited at the beginning and end of the file respectively.
HTML tags (e.g., <b>) can occur within the spans.
(zero character) spans are not included.
Let us illustrate these span definitions with an example. The last line
following data is sample text from an HTML file hypothetically named
12345.html (i.e, having PMID 12345). The numbers above represent
the tens (top line) and ones (middle) digits for the file position in
The maximum-length legal spans in this example are from bytes 0-4,
39-50. Our legalspans.txt file would include the following data:
Aaa. <p> Bbbbb <b>cc</b> ddd. <p><p><p> Eee ff ggg.
12345 0 5
Let's consider the span 8-29 further. This is a maximum-length legal
span because there is an HTML paragraph tag on either side of it. If a
system nominates a passage that exceeds these boundaries, it will be
disqualified for further analysis or judgment. But anything within the
maximum-length legal span, e.g. 8-19, 18-19, or 18-28, could be
nominated or relevant passages.
12345 8 22
12345 39 12
We note that it is possible for there to be more than one relevant
passage in a maximum-length legal span. While this will be unlikely,
our character-based scoring approach (see below) will handle it fine.
There are about 36 topics for the track this year, which were
released on June 16th. As noted above, the topics are in the form
of questions asking for lists of specific entities. These entities are
based on controlled terminologies from different sources, with the
source of the terms depending on the entity type. Below
is a table of the entity types.
||Immunoglobulin molecules having a specific amino acid
sequence by virtue of which they interact only with the antigen (or a
very similar shape) that induced their synthesis in cells of the
lymphoid series (especially plasma cells).
||Chemical compounds that are produced by a living organism.
|CELL OR TISSUE TYPES
||A distinct morphological or functional form of
cell, or the name of a collection of interconnected cells that perform
a similar function within an organism.
||A definite pathologic process with a characteristic set of
signs and symptoms. It may affect the whole body or any of its parts,
and its etiology, pathology, and prognosis may be known or unknown.
||A pharmaceutical preparation intended for human or veterinary
||Specific sequences of nucleotides along a molecule of DNA
the case of some viruses, RNA) which represent functional units of
||Elemental activities, such as catalysis or binding,
describing the actions of a gene product or bioactive substance at the
||GO + Aaron
||Any detectable and heritable change in the genetic material
that causes a change in the genotype and which is transmitted to
daughter cells and to succeeding generations
||A series of biochemical reactions occurring within a cell to
modify a chemical substance or transduce an extracellular
||Wikipedia (metabolic pathway, widened to include signal
||Linear polypeptides that are synthesized on ribosomes and may
be further modified, crosslinked, cleaved, or assembled into complex
proteins with several subunits.
||A genetic subtype or variant of a virus or bacterium.
|SIGNS OR SYMPTOMS
||A sensation or subjective change in health function
experienced by a patient, or an objective indication of some medical
fact or quality that is detected by a physician during a physical
examination of a patient.
||A measure of the degree and the manner in which which
something is toxic or poisonous to a living organism.
||Wikipedia + Aaron
||An abnormal growth of tissue, originating from a specific
tissue of origin or cell type, and having defined characteristic
properties, such as a recognized histology.
A group of 15 sample topics has been developed. As can be seen, the
list entity types are incorporated into the questions as capitalized
phrases within square brackets. We have created a training topics file in an Excel spreadsheet
that has one sample topic for each of the entity types along with a
sample passage from a specific document and one or more entities. Here
is a list of the sample topics:
<T1>What [ANTIBODIES] have been used to detect protein TLR4?
<T2>What [BIOLOGICAL SUBSTANCES] have been used to measure
toxicity in response to cytarabine?
<T3>What [CELL OR TISSUE TYPES] express members of the mammalian
TIM gene family?
<T4>What [DISEASES] are associated with lysosomal abnormalities
in the nervous system?
<T5>What [DRUGS] have been tested in mouse models of Alzheimer's
<T6>What centrosomal [GENES] are implicated in diseases of brain
<T7>What [MOLECULAR FUNCTIONS] does helicase protein NS3 play in
HCV (Hepatitis C virus)?
<T8>What [MUTATIONS] in apolipoprotein genes are associated with
<T9>Which [PATHWAYS] are possibly involved in the disease ADPKD?
<T10>What [PROTEINS] does epsin1 interact with during endocytosis?
<T11>What Streptococcus pneumoniae [STRAINS] are resistent to
penicillin and erythromycin?
<T12>What [SIGNS OR SYMPTOMS] of anxiety disorder are related to
<T13>What [TOXICITIES] are associated with cytarabine?
<T14>What [TUMOR TYPES] are associated with Rb1 mutations?
An example of our topic development process is as follows. Suppose that
the information need
What is the genetic component of
We would transform this into a list question of the form:
What genes are genetically linked to
alcoholism? -> [GENE]+
Answers to this question will be passages that relate one or more
entities of type GENE to alcoholism. For example, this would be a valid
and relevant answer to the above
question: The DRD4 VNTR polymorphism
moderates craving after alcohol
consumption. (from PMID 11950104 for those who want to know) And
the GENE entity supported by this statement would be DRD4.
The official topics file is in the protected area of the track Web
site. This file was replaced due to some minor errors described above
on June 23rd. Make sure you
use the revised file, especially since we have changed the topic
numbers so they do not overlap with topic numbers used in previous
years of the track.
Runs should be submitted to the protected area of the NIST Web site by
July 15. The URL is http://ir.nist.gov/trecsubmit/g.html.
Submitted runs can contain up to 1000 passages per topic that are
relevant to answering the topic question. Passages must be
identified by the PMID, the start offset into the text file
in characters, and the length of the passage in characters. Since
we are (or pretend to be) computer scientists, the first byte of each
file will be offset 0.
must be contiguous and not longer than one paragraph. This
will be operationalized by prohibiting any passage from containing HTML
markup tags, i.e., those starting with
<P or </P. Any passage with
tags will be ignored in the judgment process but not omitted from the
scoring process. (In other words, will not count as relevant but
will count as retrieved.) Each participating group will be
allowed to submit up to three official runs, one of which must be
designated as a precendence run to be used for building pools. (Of
course, once the
relevance judgments are released, groups will be able to score any
additional runs they do.) Each passage will also need to be
assigned a corresponding rank number
and value, which will be used
to order nominated passages for rank-based performance
computations. Rank values can be
floating point numbers such as confidence values.
run will need to be submitted in a separate file, with each line
defining one nominated passage using the following format based
loosely on trec_eval. Each line in the file must contain the
following data elements, separated by white space (please only spaces
or a tab character):
Here is an example of what the file might look like:
- Topic ID -
from 200 to 235.
- Doc ID -
the HTML file minus the .html
extension. This is the PMID that has been designated by Highwire,
even though we now know that this may not be the true PMID assigned by
the NLM (i.e., used in MEDLINE). But
this is the official identifier for
- Rank number -
rank of the passage for the topic, starting with 1 for the top-ranked
passage and preceding down to as high as 1000.
- Rank value -
system-assigned score for the rank of the passage, an internal number
that should descend in value from passages ranked higher.
- Passage start -
the byte offset in the Doc ID file where the passage begins, where the
first character of the file is offset 0.
- Passage length -
the length of the passage in bytes, in 8-bit ASCII, not Unicode.
- Run tag -
assigned by the submitting group that should be distinct from all the
group's other runs (and ideally any other group's runs, so it should
probably have the group name, e.g., OHSUbaseline).
200 12474524 1 1.0 1572 27 tag1
Runs will be submitted using a form in the active participants area
of the NIST TREC Web site. A Perl script that checks your run to insure
is in the proper format will be distributed soon.
200 12513833 2 0.373 1698 54 tag1
200 12517948 3 0.222 99 159 tag1
201 12531694 1 0.907 232 38 tag1
201 12545156 2 0.456 789 201 tag1
Your run should have included a "dummy" passage for any topic for which
you did not retrieve any passages. That dummy passage should use "0" as
a docid, "0" as the passage start, and "1" as the passage length. This
will work for check_genomics.pl and does not correspond to a document
in the collection.
Runs will also need to be classified based on amount of intervention in
convering topics to queries. We will adopt the "usual" TREC rules
(detailed at http://trec.nist.gov/act_part/guidelines/trec8_guides.html)
for categorizing runs:
- Automatic - no
human modification of topics into queries
- Manual - human
modification of queries entered into your system
(or any other system) but no modification based on results obtained
(i.e., you cannot look at
the output from your runs to modify the queries)
- Interactive -
human interaction with the system, including
modification of the queries or the system after viewing the output from
or any other system (i.e., you look at
the output from the topics and corpus and adjust your system to produce
The expert judging for this evaluation will use the pooling method,
with passages corresponding to the same topic ID pooled
together. The judges will be presented with the text of the
maximum-length legal span containing each pooled passage. They then
evaluate the text of the maximum-length legal span for relevance, and
identify the portion of this text that contains an answer. This could
be all of the text of the maximum legal span, or any contiguous
substring. It is possible that one maximum legal span could result in
two separate gold standard passages, but this will likely be uncommon.
Our evaluation methodology can handle it, if the judges deem it
necessary. Full instructions for
the judges are available.
Assessing System Performance
For this year’s track, there will again be three levels of retrieval
measured: passage retrieval, aspect retrieval, and
document retrieval. Each of these provides insight into the
performance for a user trying to answer the given topic questions. Each
will be measured by some variant of mean average precision (MAP). We
also again measure the three types of performance separately. There
will not be
any summary metric to grade overall performance. A Python program to calculate these measures
with the appropriate gold standard data files is available.
Passage-level retrieval performance - character-based MAP
The original passage retrieval measure for the 2006 track was found to
be problematic in that non-content
manipulations of passages had
substantial effects on passage MAP, with one group claiming that
passages in half with no other changes doubled their (otherwise low)
score. To this end,
we defined an alternative passage
MAP (PASSAGE2) that calculated MAP as if
each character in each passage were a ranked document. In essence, the
of passages was concatenated, with each character being from a relevant
or not. We will use PASSAGE2 as the primary passage retrieval
evaluation measure in 2007.
The original measure will also be calculated. This measure
precision scores for passages based on character-level precision, using
of a similar approach used for the TREC 2004 HARD Track. For each
nominated passage, a fraction of characters will overlap with those
deemed relevant by the judges in the gold standard. At each relevant
passage, precision will be computed as the fraction of
overlapping with the gold standard passages divided by the total number
of characters included in all nominated passages from this system for
the topic up until
that point. Similar to regular MAP, remaining relevant passages that
retrieved will be added into the calculation as well, with precision
set to 0 for relevant passages not retrieved. Then the mean of
these average precisions over all topics
will be calculated to compute the mean average passage precision.
Aspect-level retrieval performance - aspect-based MAP
Aspect retrieval will be measured using the average precision for the
aspects of a topic, averaged across all topics. To compute this,
submitted run, the ranked passages will be transformed to two types of
This will result in an ordered list, for each run and each topic, of
aspects and not-relevant. Because we are uncertain of the utility
for a user of a repeated aspect (e.g., same aspect occuring again
further down the list), we will discard them from the output to be
analyzed. For the remaining aspects of a topic, we will calculate
MAP similar to how it is calculated for documents.
- the aspect of the gold standard passage that the submitted
passage overlaps with, or
Document-level retrieval performance - document-based MAP
For the purposes of this measure, any PMID that has a passage
associated with a topic ID in the set of gold standard passages will be
considered a positive document for that topic. All other documents are
considered negative for that topic. System run outputs will be
similarly collapsed, with the documents appearing in the same order as
the first time the corresponding PMID appears in the nominated passages
for that topic.
For a given system run, average precision will be measured at
point of correct (relevant) recall for a topic. The MAP will be the
mean of the average precisions across
As with many TREC tasks, groups will be able to manually modify topics
to create their queries to their systems. In addition, they will
be able to consult outside resources on the Web (e.g., gene databases)
but only in a fully
automated fashion. In other words, the original queries may be
manually modified, but interaction with external resources can only be done in an
automated fashion. For example, if your system goes and pulls
information from SOURCE, GenBank, or any other
resource, the query to
those sources and the information obtained from them must be done in an
automated way, i.e., without manual intervention.
Those who do modify queries manually must describe their runs as
or interactive, depending on whether they inspect system output (in
which case the run should be categorized as interactive) or not (in
which case the run should be categorized as manual).