TREC 2006 Genomics Track Protocol

Last updated - November 11, 2006

new What's New

The final overview paper and slide presentation from the plenary session from TREC 2006 are available. The relevance judgments and program for calculating results are available in the protected area of the track Web site and are described on the 2006 track data and tools page. The track received 92 runs from 30 groups this year. Ten relevance judges carried out their work based on the relevance judgment guidelines.

Motivations

The goal of most information retrieval (IR) systems is to retrieve documents that a user might find relevant to his or her information need. The goal of most information extraction (IE) or text mining (TM) systems is to process document text to provide the user with one or more "answers" to a question or information need. We would argue that what most information seekers, especially users of the biomedical literature, desire is something in the middle, i.e., a system that attempts to answer questions but put them in context while providing supporting information and linking to original sources.

For the TREC 2006 Genomics Track, we have developed a new single task that focuses on retrieval of passages (from part to sentence to paragraph in length) with linkage to the source document. Topics are expressed as questions and systems will be measured on how well they retrieve relevant information at the passage, aspect, and document level. Systems must return passages linked to source documents, while relevance judges will need to not only rate the passages, but also group them by aspect. (Aspect here is being defined similar to how it was defined in the TREC Interactive Track aspectual recall task, representing all answers that are similar. An example is below.)

We do note this task is somewhat experimental. However, we believe it represents an important functionality for the biomedical user of literature retrieval systems. We are willing to concede that the first year of this task may not go perfectly, and a second year might be required to "get it right." Fortunately, the National Science Foundation (NSF) grant funding the Genomics Track provides us that luxury.

Documents

The documents for this task come from the new full-text biomedical corpus we have assembed. We have obtained permission from a number of publishers who use Highwire Press for electronic distribution of their journals. They have agreed to allow us to include their full text in HTML format, which preserves formatting, structure, table and figure legends, etc.. For more information on the document collection, see the 2006 track data page. As noted in that file, there are some issues with the document collection:
Since the goal of the task is passage retrieval, we have developed some additional data sources that will aid us in managing and evaluating runs. As noted below, retrieved passages can contain any span of text that does not include an HTML paragraph tag (i.e., one starting with <P or </P). We will also use these delimiters to extract text that will be assessed by the relevance judges. Because there has been much confusion in the discussion about the different types of passages, we define the following terms:
We note some other things about the maximum-length legal spans:
In order to facilitate our management of the data, and perhaps be of use to participants, we have created a file, legalspans.txt, which includes all of the maximum-length legal spans for the collection. This 215-megabyte file is in the protected area of the Web site. Note that the first span includes all of the HTML prior to the first <p>, which will obviously not be part of any relevant passage. This file identifies all the maximum-length legal spans in all of the documents, which consist of all spans >0 bytes delimited by HTML paragraph tags. These spans are identified by the byte offset and length in the HTML file. The index number of the first byte of the file is 0.

Let us illustrate these span definitions with an example. The last line of the following data is sample text from an HTML file hypothetically named 12345.html (i.e, having PMID 12345). The numbers above represent the tens (top line) and ones (middle) digits for the file position in bytes.
000000000011111111112222222222333333333344444444445
012345678901234567890123456789012345678901234567890
Aaa. <p> Bbbbb <b>cc</b> ddd. <p><p><p> Eee ff ggg.
The maximum-length legal spans in this example are from bytes 0-4, 8-29, and 39-50. Our legalspans.txt file would include the following data:
12345 0  5
12345 8 22
12345 39 12
Let's consider the span 8-29 further. This is a maximum-length legal span because there is an HTML paragraph tag on either side of it. If a system nominates a passage that exceeds these boundaries, it will be disqualified for further analysis or judgment. But anything within the maximum-length legal span, e.g. 8-19, 18-19, or 18-28, could be nominated or relevant passages.

We note that it is possible for there to be more than one relevant passage in a maximum-length legal span. While this will be unlikely, our character-based scoring approach (see below) will handle it fine.

Topics

The topics for the 2006 track are expressed as questions. They were derived from the set of biologically relevant questions based on the Generic Topic Types (GTTs) developed last year for the 2005 track. These questions each have one or more aspects that are contained in the literature corpus (i.e., one or more answers to each question). A few things should be noted about the topics for 2006:
There have been some repeated questions about two topics for 2006, numbers 164 and 180. Let us answer them here:
We note that the questions (and GTTs) all have the general format of containing one or more biological objects and processes and some explicit relationship between them: 
	Biological object (1..many) <--relationship--> Biological process (1..many) 
The biological object might be genes, proteins, gene mutations, etc. The biological process can be physiological processes or diseases. The relationships can be anything, but are typically verbs such as causes, associated with, or phosphorylates. We determined four out of the five GTTs from 2005 could be reformulated into the above structure, with the exception of the first GTT asking procedures or methods. The patterns for doing this from the GTTs were based on the examples in the following table:
 
GTT Question Pattern Example
Find articles describing the role of a gene involved in a given disease. What is the role of gene in disease? What is the role of DRD4 in alcoholism?
Find articles describing the role of a gene in a specific biological process. What effect does gene have on biological process? What effect does the insulin receptor gene have on tumorigenesis?
Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease. How do genes interact in organ function? How do HMG and HMGB1 interact in hepatitis?
Find articles describing one or more mutations of a given gene and its biological impact. How does a mutation in gene influence biological process? How does a mutation in Ret influence thyroid function?

The "data structure" of "answers" to the topic questions will consist of passages, each linked to both a document, specified by the PubMed ID (PMID), and an aspect. Some passages may be linked to more than one aspect. Some passages may overlap and/or belong to multiple aspects. Ultimately, the relevance judges (our primary resource constraint, but we do have resources to recruit and remunerate them) will determine the "best" passages and group them into aspects. The judges will also assign one or more Medical Subject Headings (MeSH) terms (possibly with subheadings) to each aspect. (Earlier versions of this protocol suggested using Gene Ontology (GO) terms for grouping aspects, but our anlaysis determined that MeSH terms work better for this.)

Submissions

Submitted runs can contain up to 1000 passages per topic that are predicted to be relevant to answering the topic question. Passages must be identified by the PMID, the start offset into the text file in characters, and the length of the passage in characters. Since we are (or pretend to be) computer scientists, the first byte of each file will be offset 0.

Passages must be contiguous and not longer than one paragraph. This will be operationalized by prohibiting any passage from containing HTML markup tags, i.e., those starting with <P or </P. Any passage with these tags will be ignored in the judgment process but not omitted from the scoring process. (In other words, will not count as relevant but will count as retrieved.) Each participating group will be allowed to submit up to three official runs, one of which must be designated as a precendence run to be used for building pools. (Of course, once the relevance judgments are released, groups will be able to score any additional runs they do.) Each passage will also need to be assigned a corresponding rank number and value, which will be used to order nominated passages for rank-based performance computations. Rank values can be integers, or floating point numbers such as confidence values.

Each submitted run will need to be submitted in a separate file, with each line defining one nominated passage using the following format based loosely on trec_eval. Each line in the file must contain the following data elements, separated by white space (please only spaces or a tab character):
Here is an example of what the file might look like:
160 12474524 1 1.0   1572 27  tag1
160 12513833 2 0.373 1698 54 tag1
160 12517948 3 0.222 99 159 tag1
161 12531694 1 0.907 232 38 tag1
161 12545156 2 0.456 789 201 tag1
Runs were submitted using the form in the active participants area of the NIST TREC Web site: http://ir.nist.gov/trecsubmit/g.html. The following link has a Perl script that checks your run to insure it is in the proper format: http://trec.nist.gov/act_part/scripts/06.scripts/check_genomics.pl. It does not check to see whether your spans are "legal," as defined above.

Your run should have included a "dummy" passage for any topic for which you did not retrieve any passages. That dummy passage should use "0" as a docid, "0" as the passage start, and "1" as the passage length. This will work for check_genomics.pl and does not correspond to a document in the collection.

Because of the complex nature of this year's task, and most groups' not having a system in place before release of the topics, the classification of runs will be complicated this year. Here is a summary of "usual" TREC rules (detailed at http://trec.nist.gov/act_part/guidelines/trec8_guides.html) for categorizing runs: However, because we are reusing topics and because people are building systems up to the last minute, the following rules apply to how you should classify your runs:

Relevance Judgments

The expert judging for this evaluation will use the pooling method, with passages corresponding to the same topic ID pooled together. The judges will be presented with the text of the maximum-length legal span containing each pooled passage. They then evaluate the text of the maximum-length legal span for relevance, and identify the portion of this text that contains an answer. This could be all of the text of the maximum legal span, or any contiguous substring. It is possible that one maximum legal span could result in two separate gold standard passages, but this will likely be uncommon. Our evaluation methodology can handle it, if the judges deem it necessary. The full instructions for judges are available.

Passages will be pooled using the following procedure:
  1. For a given topic, all maximum-length legal spans that have been pooled will be identified.
  2. Based on the conventional TREC pooling approach, we will take the top X passages from each run, where X is a number set to give a desired number of passages based on availability of assessing resources.
  3. Steps 1 and 2 will be repeated for all topics. This will result in a set of at most 1000 paragraphs of text per topic for the expert judges to evaluate.
  4. The expert judges will be given the list of topics and corresponding paragraphs. For each topic, and each paragraph within a topic, the judges will mark all or part of the paragraph as relevant to answering the topic question. Judges will assign MeSH term-based aspects to each contiguous span of relevant text. Judges will be instructed to use the most specific MeSH term, similar to the NLM literature indexing process. If the best term available is not-specific enough to denote an aspect of the question, judges may assign two MeSH terms to a relevant passage and/or use a MeSH term in combination with a subheading.
Sample passages and aspects for two topics are posted in both HTML and as a spreadsheet. It should be remembered that these should not be considered training data, but rather are examples of what passages and aspects will look like in the final data. It should also be noted that the PMIDs in these samples are not in the actual document collection.

The judging process will result in a set of gold standard passages. These will be designated in a file containing the following data elements, separated by a tab character: 
Here is an example of what the file might look like:
160 12474524 1572 27  MeSH1;MeSH2
160 12513833 1698 54 MeSH1;MeSH3
160 12517948 99 159 MeSH4

Measuring System Performance

For this year’s track, there are three levels of retrieval performance that we will measure: passage retrieval, aspect retrieval, and document retrieval. Each of these provides insight into the overall performance for a user trying to answer the given topic questions. Each will be measured by some variant of mean average precision (MAP).

Because this is a new task, and uncharted research territory, we will measure the three types of performance separately. We are not proposing any summary metric to grade overall performance, but instead wish to examine each aspect of performance in a way that this both as meaningful and straightforward as we can at our current level of experience with this task. It is clear that future work will refine and modify these measures. One of our goals is to collect sufficient experience this year to enable future refinements.

Passage-level retrieval performance - character-based MAP

This measure will use a variation of MAP, computing individual precision scores for passages based on character-level precision, using a variant of a similar approach used for the TREC 2004 HARD Track. For each nominated passage, a fraction of characters will overlap with those deemed relevant by the judges in the gold standard. At each relevant retrieved passage, precision will be computed as the fraction of characters overlapping with the gold standard passages divided by the total number of characters included in all nominated passages from this system for the topic up until that point. Similar to regular MAP, remaining relevant passages that were not retrieved will be added into the calculation as well, with precision set to 0 for relevant passages not retrieved. Then the mean of these average precisions over all topics will be calculated to compute the mean average passage precision (MAPP).

Aspect-level retrieval performance - aspect-based MAP

Aspect retrieval will be measured using the average precision for the aspects of a topic, averaged across all topics. To compute this, for each submitted run, the ranked passages will be transformed to two types of values, either: This will result in an ordered list, for each run and each topic, of aspects and not-relevant. Because we are uncertain of the utility for a user of a repeated aspect (e.g., same aspect occuring again further down the list), we will discard them from the output to be analyzed. For the remaining aspects of a topic, we will calculate MAP similar to how it is calculated for documents.

Document-level retrieval performance - document-based MAP

For the purposes of this measure, any PMID that has a passage associated with a topic ID in the set of gold standard passages will be considered a positive document for that topic. All other documents are considered negative for that topic. System run outputs will be similarly collapsed, with the documents appearing in the same order as the first time the corresponding PMID appears in the nominated passages for that topic.

For a given system run, average precision will be measured at each point of correct (relevant) recall for a topic. The MAP will be the mean of the average precisions across topics.

Sample Data and Performance Calculations

For the calculation of passage measures, consider the following hypothetical data for a topic in a run.

For this topic, there are five nominated passages:
A bb ccc ddd ee ff
G hh iii jjj kk ll mm
N oo pp q ss ttt u
V w
Xxx yy zzz

The relevance judge determined that the underlined portions are relevant for this topic:
A bb ccc ddd ee ff
G hh iii jjj kk ll mm
N oo pp q ss ttt u
V w
Xxx yy zzz

There is also one other relevant passage not retrieved for this topic:
B bbb yy o

Statistically, this topic in this run has: 
The table below shows each character from each retrieved passage and its location:
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
A
  b
b
  c
c
c
  d
d
d
  e
e
  f
f



G
  h
h
  i
i
i
  j
j
j
  k
k
  l
l
  m
m
N
  o
o
  p
p
  q
  s
s
  t
t
t
  u



V
  w


















X
x
x
  y
y
  z
z
z












For the calculation of aspect and document measures, consider the following different hypothetical data for a topic in a run. Note that the middle column contains the passages nominated for the run. Each passage is linked to a document (designated by its PMID). Each passage either has a corresponding aspect or it is not relevant. If there is any relevant passage/aspect in a document, then the document is deemed relevant.

The passages were nominated by the run and were judged for relevance and assigned to aspects by a judge. The passages are also linked to the documents from which they came. Let's also assume there are another 3 relevant passages corresponding to 3 aspects. This means that for this topic, there are:
Document
Passage
Aspect
D1
P1
A1
D2
P2
A2
D3
P3
NR
D1
P4
A3
D4
P5
A4
D1
P6
A1
D5
P7
NR
D2
P8
A2
D6
P9
A5

Passages

Our primary performance measure will be character-based mean average passage precision (MAPP). How will this be calculated?

At the end of each nominated passage, we will calculate cumulative character-based recall (CCR) and cumulative character-based precision (CCP). CCR is the number of relevant characters that have been nominated at the end of a passage divided by the total number of relevant characters in all passages. CCP is the number of relevant characters that have been nominated at the end of a passage divided by the total number of characters that have been nominated. This is demonstrated in the following table.
 
Passage
Chars in passage
Relevant chars in passage
CCR
CCP
A bb ccc ddd ee ff 18
12
12/40 = 0.3
12/18 = 0.67
G hh iii jjj kk ll mm 21
0
12/40 = 0.3
12/39 = 0.31
N oo pp q ss ttt u 18
18
30/40 = 0.75
30/57 = 0.53
V w 3
0
30/40 = 0.75
30/60 = 0.5
Xxx yy zzz 10
0
30/40 = 0.75
30/70 = 0.43

There is also one additional passage as noted above, B bbb yy o, that has not been retrieved. Per the usual approach to MAP, this will give a CCP of 0 for this passage. In this example, average passage precision (APP) is the average of the CCP at all retrieved relevant passages, which is 0.39. MAPP will then be the mean of all these average APPs for all topics in the collection.

Aspects

Our primary performance measure will be mean average aspect precision (MAAP). How will we calculate this?

As noted in the protocol, we will measure the highest rank of a passage that contains all or part of a particular aspect. There is no minimum of how much of the passage is present, since we do not really know what a significant cut-off should be. So if any character of a passage for an aspect is present, the aspect is considered retrieved. If an aspect has previous appeared in the list, it will be omitted (based on our decision this year that we do not know what we really want to do with repeated aspects). Thus the list from the sample data above will collapse to the following:
 
Rank
Aspect
1
A1
2
A2
3
NR
4
A3
5
A4
6
NR
7
A5
 
We will then calculate the average precision for each retrieval relevant aspect. Each will be averaged to obtain the MAAP.

Documents

Our primary performance meaasure is the standard mean average precision (MAP). From the standpoint of retrieval, documents will appear in the list only once, so those retrieved by a different passage previously will be omitted. As such, the list of retrieved documents above will collapse as follows.
 
Rank
Document
1
D1
2
D2
3
D3
4
D4
5
D5
6
D6
7
D7
 
As noted above, D1, D2, D4, D6, and D7-D9 are relevant. MAP will be calculated by trec_eval in the usual way.

Other Rules

As with many TREC tasks, groups will be able to manually modify topics to create their queries to their systems. In addition, they will be able to consult outside resources on the Web (e.g., gene databases) but only in a fully automated fashion. In other words, the original queries may be manually modified, but interaction with external resources can only be done in an automated fashion. For example, if your system goes and pulls information from SOURCE, GenBank, or any other resource, the query to those sources and the information obtained from them must be done in an automated way, i.e., without manual intervention.

Those who do modify queries manually must describe their runs as manual or interactive, depending on whether they inspect system output (in which case the run should be categorized as interactive) or not (in which case the run should be categorized as manual).

Timeline

Here is the working timeline for the 2006 track: