TREC 2007 Genomics Track Protocol

Last updated: November 30, 2008

The relevance judgments are posted in the protected area of the track Web site. An overview of the results is available.

Introduction

For the TREC 2007 Genomics Track, we are undertaking a modification of the question answering extraction task used in 2006. We will continue to task systems with extracting out relevant passages of text that answer topic questions. However for this year, instead of categorizing questions by generic topic type (GTT), we will derive questions based on biologists’ information needs where the answers, in part, are lists of named entities of a given type. Systems will still be required to return a passage of text, which will provide one or more relevant list items in context.

We have gathered new information needs from working biologists. This was done by modifying the questionnaire used in 2004 to survey several biologists about recent information needs. In addition to asking about information needs, biologists were asked if their desired answer is a list of a certain type of thing, such as genes, proteins, diseases, mutations, etc. We collected about 50 information needs statements, of which about 36 will be used as topics and 14 used as sample topics. A list of entity types is provided below.

Similar to last year, systems will return passages of text. Relevance judges will assign the relevant passages "answers," or items belonging to a single entity class, analogous to the assignment of MeSH aspects in 2006. After pooling the top nominated passages as in past years, relevance judges will select relevant passages. Judges will then assign one or more answer identifiers to each relevant passage.

Passages must contain one or more named entities of the given type with supporting text that answers the given question to be marked relevant. Passages will be given credit for each relevant and supported answer. This is required because it is likely that the expert judges will not be able to judge the response unless it contains an entity of the type that they are looking for.

We will use the same full-text document corpus assembled for last year's Genomics Track. The documents in this corpus will come from the Highwire Press electronic distribution of journals and are in HTML format. There are about 160,000 documents in the corpus from about 49 genomics related journals.

The evaluation measures for this task are a refinement of the measures used in 2006. We will continue to use document MAP as is, i.e., a document that contains a passage judged relevant is deemed relevant. We will use a character-based MAP measure like last year to compare the accuracy of the extracted answers, but will modify it to treat each individually retrieved character in published order as relevant or not, in a sort of "every character is a mini relevance-judged document" approach. This will increase the stability of the passage MAP measure against arbitrary passage splitting techniques. The aspect measure will remain the same, except that instead of using assigned MeSH aspects we will simply use the answer entities assigned by the relevance judges. Entity-based MAP should be better than the prior year’s aspect measure for valuing a wider range of correct answers ranked higher in the retrieval because answer diversity based on entities should have less overlap between passages for questions with multiple answers than last year’s MeSH categories.

Documents

The documents for this task come from the new full-text biomedical corpus we have assembed. We have obtained permission from a number of publishers who use Highwire Press for electronic distribution of their journals. They have agreed to allow us to include their full text in HTML format, which preserves formatting, structure, table and figure legends, etc.. For more information on the document collection, see the 2006 track data page. As noted in that file, there are some issues with the document collection:

The collection is not complete from the standpoint of each entire journal. That is, there are many journals where the articles appeared in the journal, but did not make it into our collection. (Neither the article nor the MEDLINE record.) This is not an issue to us, since we view the collection as a closed and fixed collection.
Some of the PMIDs in the source data from Highwire Press are inconsistent with PubMed PMIDs (see next paragraph for explanation).
Some of the HTML files are empty or nearly empty (i.e., only contain a small amount of meaningless text). Some of this is due to errors in our processing, but some is also related to the incorrect PMID problem of Highwire. We have forzen the collection for now and, since these files are small, they are unlikely to have any relevant passages.

As noted in the 2006 protocol, there are some errors between the PMIDs designated by Highwire and the actual PMIDs from NLM in MEDLINE. We have identified 1,767 instances (about 1% of the 162K documents) where the Highwire file PMID is invalid, in the sense that it returns zero hits when searching it on PubMed. Some invalid PMIDs are due to the fact that the corresponding documents represent errata and author responses to comments (e.g., author replies to letters). These have been assigned PMIDs in publisher-supplied data, but NLM generally does not cite them separately in PubMed, and therefore deleted the PMIDs, although they remained in publisher data. There are documents already assigned a PMID submitted by Highwire that NLM, by policy, decided not to index at all, in which case, again, NLM deleted the PMID, but it was retained in Highwire data. We also have found instances of invalid PMIDs in Highwire data for documents that are cited in PubMed but with a different PMID which is absent from Highwire data; such instances can be characterized as errors. In any case, we have investigated the problem of invalid PMIDs and found that for all instances we checked, the problem was the original Highwire file having an invalid PMID. In other words, invalid PMIDs are in the Highwire data, not a result of our processing. For this reason, we have decided not to delete these files from the collection. They represent, in our view, normal dirty data, whether due to errors or policy differences between NLM and publishers.

Since the goal of the task is passage retrieval, we have developed some additional data sources that will aid us in managing and evaluating runs. As noted below, retrieved passages can contain any span of text that does not include an HTML paragraph tag (i.e., one starting with <P or </P). We will also use these delimiters to extract text that will be assessed by the relevance judges. Because there has been much confusion in the discussion about the different types of passages, we define the following terms:

Nominated passage - This is the passage that systems nominate in their runs and will be scored in the passage retrieval evaluation.
Maximum-length legal span - These are all the passages obtained by delimited the text of each document by the HTML paragraph tags. As noted below, nominated passages cannot cross an HTML paragraph boundary. So these spans represent the longest possible passage that can be designated as relevant. As also noted below, we will build pools of these spans for the relevance judges. The judges will be given the entire span, even if no system nominated the entire span. However, the judges do not need to desigate the entire span as relevant, and may select just a part of the span to be relevant.
Relevant passage - These are the spans that the judges will designate as definitely or possibly relevant.

We note some other things about the maximum-length legal spans:

The first and last spans are delimited at the beginning and end of the file respectively.
Other HTML tags (e.g., <b>) can occur within the spans.
"Empty" (zero character) spans are not included.

In order to facilitate our management of the data, and perhaps be of use to participants, we have created a file, legalspans.txt, which includes all of the maximum-length legal spans for the collection. This 215-megabyte file is in the protected area of the Web site. Note that the first span includes all of the HTML prior to the first <p>, which will obviously not be part of any relevant passage. This file identifies all the maximum-length legal spans in all of the documents, which consist of all spans >0 bytes delimited by HTML paragraph tags. These spans are identified by the byte offset and length in the HTML file. The index number of the first byte of the file is 0.

Let us illustrate these span definitions with an example. The last line of the following data is sample text from an HTML file hypothetically named 12345.html (i.e, having PMID 12345). The numbers above represent the tens (top line) and ones (middle) digits for the file position in bytes.

000000000011111111112222222222333333333344444444445
012345678901234567890123456789012345678901234567890
Aaa. <p> Bbbbb <b>cc</b> ddd. <p><p><p> Eee ff ggg.

The maximum-length legal spans in this example are from bytes 0-4, 8-29, and 39-50. Our legalspans.txt file would include the following data:

12345 0  5
12345 8  22
12345 39 12

Let's consider the span 8-29 further. This is a maximum-length legal span because there is an HTML paragraph tag on either side of it. If a system nominates a passage that exceeds these boundaries, it will be disqualified for further analysis or judgment. But anything within the maximum-length legal span, e.g. 8-19, 18-19, or 18-28, could be nominated or relevant passages.

We note that it is possible for there to be more than one relevant passage in a maximum-length legal span. While this will be unlikely, our character-based scoring approach (see below) will handle it fine.

Topics

There are about 36 topics for the track this year, which were released on June 16th. As noted above, the topics are in the form of questions asking for lists of specific entities. These entities are based on controlled terminologies from different sources, with the source of the terms depending on the entity type. Below is a table of the entity types.

Entity Type	Definition	Source
ANTIBODIES	Immunoglobulin molecules having a specific amino acid sequence by virtue of which they interact only with the antigen (or a very similar shape) that induced their synthesis in cells of the lymphoid series (especially plasma cells).	MeSH
BIOLOGICAL SUBSTANCES	Chemical compounds that are produced by a living organism.	Phoebe
CELL OR TISSUE TYPES	A distinct morphological or functional form of cell, or the name of a collection of interconnected cells that perform a similar function within an organism.	Wikipedia
DISEASES	A definite pathologic process with a characteristic set of signs and symptoms. It may affect the whole body or any of its parts, and its etiology, pathology, and prognosis may be known or unknown.	MeSH
DRUGS	A pharmaceutical preparation intended for human or veterinary use.	MeSH
GENES	Specific sequences of nucleotides along a molecule of DNA (or, in the case of some viruses, RNA) which represent functional units of heredity.	MeSH
MOLECULAR FUNCTIONS	Elemental activities, such as catalysis or binding, describing the actions of a gene product or bioactive substance at the molecular level.	GO + Aaron
MUTATIONS	Any detectable and heritable change in the genetic material that causes a change in the genotype and which is transmitted to daughter cells and to succeeding generations	MeSH
PATHWAYS	A series of biochemical reactions occurring within a cell to modify a chemical substance or transduce an extracellular signal.	Wikipedia (metabolic pathway, widened to include signal transduction pathways)
PROTEINS	Linear polypeptides that are synthesized on ribosomes and may be further modified, crosslinked, cleaved, or assembled into complex proteins with several subunits.	MeSH
STRAINS	A genetic subtype or variant of a virus or bacterium.	Wikipedia
SIGNS OR SYMPTOMS	A sensation or subjective change in health function experienced by a patient, or an objective indication of some medical fact or quality that is detected by a physician during a physical examination of a patient.	Wikipedia
TOXICITIES	A measure of the degree and the manner in which which something is toxic or poisonous to a living organism.	Wikipedia + Aaron
TUMOR TYPES	An abnormal growth of tissue, originating from a specific tissue of origin or cell type, and having defined characteristic properties, such as a recognized histology.	Aaron

A group of 15 sample topics has been developed. As can be seen, the list entity types are incorporated into the questions as capitalized phrases within square brackets. We have created a training topics file in an Excel spreadsheet that has one sample topic for each of the entity types along with a sample passage from a specific document and one or more entities. Here is a list of the sample topics:
<T1>What [ANTIBODIES] have been used to detect protein TLR4?
<T2>What [BIOLOGICAL SUBSTANCES] have been used to measure toxicity in response to cytarabine?
<T3>What [CELL OR TISSUE TYPES] express members of the mammalian TIM gene family?
<T4>What [DISEASES] are associated with lysosomal abnormalities in the nervous system?
<T5>What [DRUGS] have been tested in mouse models of Alzheimer's disease?
<T6>What centrosomal [GENES] are implicated in diseases of brain development?
<T7>What [MOLECULAR FUNCTIONS] does helicase protein NS3 play in HCV (Hepatitis C virus)?
<T8>What [MUTATIONS] in apolipoprotein genes are associated with disease?
<T9>Which [PATHWAYS] are possibly involved in the disease ADPKD?
<T10>What [PROTEINS] does epsin1 interact with during endocytosis?
<T11>What Streptococcus pneumoniae [STRAINS] are resistent to penicillin and erythromycin?
<T12>What [SIGNS OR SYMPTOMS] of anxiety disorder are related to lipid levels?
<T13>What [TOXICITIES] are associated with cytarabine?
<T14>What [TUMOR TYPES] are associated with Rb1 mutations?

An example of our topic development process is as follows. Suppose that the information need was:
What is the genetic component of alcoholism?
We would transform this into a list question of the form:
What genes are genetically linked to alcoholism? -> [GENE]+
Answers to this question will be passages that relate one or more entities of type GENE to alcoholism. For example, this would be a valid and relevant answer to the above question: The DRD4 VNTR polymorphism moderates craving after alcohol consumption. (from PMID 11950104 for those who want to know) And the GENE entity supported by this statement would be DRD4.

The official topics file is in the protected area of the track Web site. This file was replaced due to some minor errors described above on June 23rd. Make sure you use the revised file, especially since we have changed the topic numbers so they do not overlap with topic numbers used in previous years of the track.

Submissions

Runs should be submitted to the protected area of the NIST Web site by July 15. The URL is http://ir.nist.gov/trecsubmit/g.html.

Submitted runs can contain up to 1000 passages per topic that are predicted to be relevant to answering the topic question. Passages must be identified by the PMID, the start offset into the text file in characters, and the length of the passage in characters. Since we are (or pretend to be) computer scientists, the first byte of each file will be offset 0.

Passages must be contiguous and not longer than one paragraph. This will be operationalized by prohibiting any passage from containing HTML markup tags, i.e., those starting with <P or </P. Any passage with these tags will be ignored in the judgment process but not omitted from the scoring process. (In other words, will not count as relevant but will count as retrieved.) Each participating group will be allowed to submit up to three official runs, one of which must be designated as a precendence run to be used for building pools. (Of course, once the relevance judgments are released, groups will be able to score any additional runs they do.) Each passage will also need to be assigned a corresponding rank number and value, which will be used to order nominated passages for rank-based performance computations. Rank values can be integers, or floating point numbers such as confidence values.

Each submitted run will need to be submitted in a separate file, with each line defining one nominated passage using the following format based loosely on trec_eval. Each line in the file must contain the following data elements, separated by white space (please only spaces or a tab character):

Topic ID - from 200 to 235.
Doc ID - name of the HTML file minus the .html extension. This is the PMID that has been designated by Highwire, even though we now know that this may not be the true PMID assigned by the NLM (i.e., used in MEDLINE). But this is the official identifier for the document.
Rank number - rank of the passage for the topic, starting with 1 for the top-ranked passage and preceding down to as high as 1000.
Rank value - system-assigned score for the rank of the passage, an internal number that should descend in value from passages ranked higher.
Passage start - the byte offset in the Doc ID file where the passage begins, where the first character of the file is offset 0.
Passage length - the length of the passage in bytes, in 8-bit ASCII, not Unicode.
Run tag - a tag assigned by the submitting group that should be distinct from all the group's other runs (and ideally any other group's runs, so it should probably have the group name, e.g., OHSUbaseline).

Here is an example of what the file might look like:

200 12474524 1 1.0   1572 27  tag1
200 12513833 2 0.373 1698 54  tag1
200 12517948 3 0.222 99   159 tag1
201 12531694 1 0.907 232  38  tag1
201 12545156 2 0.456 789  201 tag1

Runs will be submitted using a form in the active participants area of the NIST TREC Web site. A Perl script that checks your run to insure it is in the proper format will be distributed soon.

Your run should have included a "dummy" passage for any topic for which you did not retrieve any passages. That dummy passage should use "0" as a docid, "0" as the passage start, and "1" as the passage length. This will work for check_genomics.pl and does not correspond to a document in the collection.

Runs will also need to be classified based on amount of intervention in convering topics to queries. We will adopt the "usual" TREC rules (detailed at http://trec.nist.gov/act_part/guidelines/trec8_guides.html) for categorizing runs:

Automatic - no human modification of topics into queries for your system whatsover
Manual - human modification of queries entered into your system (or any other system) but no modification based on results obtained (i.e., you cannot look at the output from your runs to modify the queries)
Interactive - human interaction with the system, including modification of the queries or the system after viewing the output from your system or any other system (i.e., you look at the output from the topics and corpus and adjust your system to produce different output)

Relevance Judgments

The expert judging for this evaluation will use the pooling method, with passages corresponding to the same topic ID pooled together. The judges will be presented with the text of the maximum-length legal span containing each pooled passage. They then evaluate the text of the maximum-length legal span for relevance, and identify the portion of this text that contains an answer. This could be all of the text of the maximum legal span, or any contiguous substring. It is possible that one maximum legal span could result in two separate gold standard passages, but this will likely be uncommon. Our evaluation methodology can handle it, if the judges deem it necessary. Full instructions for the judges are available.

Assessing System Performance

For this year’s track, there will again be three levels of retrieval performance measured: passage retrieval, aspect retrieval, and document retrieval. Each of these provides insight into the overall performance for a user trying to answer the given topic questions. Each will be measured by some variant of mean average precision (MAP). We will also again measure the three types of performance separately. There will not be any summary metric to grade overall performance. A Python program to calculate these measures with the appropriate gold standard data files is available.

Passage-level retrieval performance - character-based MAP

The original passage retrieval measure for the 2006 track was found to be problematic in that non-content manipulations of passages had substantial effects on passage MAP, with one group claiming that breaking passages in half with no other changes doubled their (otherwise low) score. To this end, we defined an alternative passage MAP (PASSAGE2) that calculated MAP as if each character in each passage were a ranked document. In essence, the output of passages was concatenated, with each character being from a relevant passage or not. We will use PASSAGE2 as the primary passage retrieval evaluation measure in 2007.

The original measure will also be calculated. This measure computes individual precision scores for passages based on character-level precision, using a variant of a similar approach used for the TREC 2004 HARD Track. For each nominated passage, a fraction of characters will overlap with those deemed relevant by the judges in the gold standard. At each relevant retrieved passage, precision will be computed as the fraction of characters overlapping with the gold standard passages divided by the total number of characters included in all nominated passages from this system for the topic up until that point. Similar to regular MAP, remaining relevant passages that were not retrieved will be added into the calculation as well, with precision set to 0 for relevant passages not retrieved. Then the mean of these average precisions over all topics will be calculated to compute the mean average passage precision.

Aspect-level retrieval performance - aspect-based MAP

Aspect retrieval will be measured using the average precision for the aspects of a topic, averaged across all topics. To compute this, for each submitted run, the ranked passages will be transformed to two types of values, either:

the aspect of the gold standard passage that the submitted passage overlaps with, or
not-relevant

This will result in an ordered list, for each run and each topic, of aspects and not-relevant. Because we are uncertain of the utility for a user of a repeated aspect (e.g., same aspect occuring again further down the list), we will discard them from the output to be analyzed. For the remaining aspects of a topic, we will calculate MAP similar to how it is calculated for documents.

Document-level retrieval performance - document-based MAP

For the purposes of this measure, any PMID that has a passage associated with a topic ID in the set of gold standard passages will be considered a positive document for that topic. All other documents are considered negative for that topic. System run outputs will be similarly collapsed, with the documents appearing in the same order as the first time the corresponding PMID appears in the nominated passages for that topic.

For a given system run, average precision will be measured at each point of correct (relevant) recall for a topic. The MAP will be the mean of the average precisions across topics.

Other Rules

As with many TREC tasks, groups will be able to manually modify topics to create their queries to their systems. In addition, they will be able to consult outside resources on the Web (e.g., gene databases) but only in a fully automated fashion. In other words, the original queries may be manually modified, but interaction with external resources can only be done in an automated fashion. For example, if your system goes and pulls information from SOURCE, GenBank, or any other resource, the query to those sources and the information obtained from them must be done in an automated way, i.e., without manual intervention.

Those who do modify queries manually must describe their runs as manual or interactive, depending on whether they inspect system output (in which case the run should be categorized as interactive) or not (in which case the run should be categorized as manual).