TREC 2005 Genomics Track Protocol

William Hersh, Track Chair

Last updated - November 28, 2005

This page contains the protocol for the TREC 2005 Genomics Track. As with the 2003 and 2004 tracks, there were two tasks. Similar to 2004, one of the tasks consisted of ad hoc retrieval, while the second involved text categorization. There is a companion data page for the 2005 track that includes instructions for accessing the data.

Official Runs

For the ad hoc task, 58 official runs were submitted by 32 participating research groups. For the categorization tasks, 192 runs were submitted by 19 groups. The results are reported in the track overview paper.

Ad Hoc Retrieval Task

In the 2005 ad hoc retrieval task, we employed topics that were more structured than the mostly free-form topics from the 2004 track. The purpose of this approach was to provide systems with better defined (yet still realistic) queries for finding genomics information. As such, we developed topics from generic structured templates so systems could make better use of other resources, such as ontologies or databases. We also hoped this could serve as the basis to begin investigation toward an interactive task.

Topics

As with 2004, we collected topics from real biologists. However, instead of soliciting free-form topics, we provided the biologists we interviewed with generic templates and asked them to express information needs they had recently that fit within those templates.

While it would have been ideal to interview users and develop the templates themselves from such interviews, the time frame of the track did not allow this. Instead, we developed a set of generic topic templates (GTTs) derived from an analysis of the topics from the 2004 track and other known biologist information needs. After we developed the GTTs, 11 people went out interviewed 25 biologists to obtain instances of them. We then had other people do some searching on the topics to make sure there was at least some, although not too much, information out there about them. The topics did not have to fit precisely into the GTTs, but had to come close (i.e., have all the required semantic types).

As with 2004, there were 50 topics. We reached closure on 5 GTTs, each of which had 10 instances, for a total of 50 topics. The five GTTs are listed below. (We had an extra one in case one did not pan out during the interviews, which turned out to be the case.) The semantic types in each GTT are underlined. For some semantic types, more than one instance is allowed. The five GTTs are:

Find articles describing standard methods or protocols for doing some sort of experiment or procedure.
Find articles describing the role of a gene involved in a given disease.
Find articles describing the role of a gene in a specific biological process.
Find articles describing interactions (e.g., promote, suppress, inhibit, etc.) between two or more genes in the function of an organ or in a disease.
Find articles describing one or more mutations of a given gene and its biological impact.

In order to get participating groups started with the topics, and in order for them not to "spoil" their automatic status of their official runs by working with the official topics, we developed 10 sample topics, with two coming from each GTT. Both the sample and official topics were available in three formats: a Word file containing the GTTs and their instances (the topics) in tabular form, a PDF file that is a "printout" of the Word file, and a text file that has the topics expressed in a narrative form (essentially the GTTs filled in with the instances). Here are links to the sample topics:

sample2005narrative.txt - Narrative version of official topics (90-99)
sample2005topics.doc - Tabular (Microsoft Word) version of official topics (90-99)
sample2005topics.pdf - Tabular (PDF) version of official topics (90-99)

All of the sample topics had sample searches and associated relevance judgments for the retrieved articles. The files with these searches and judgments were posted in the active user portion of the track Web site. The files were named 9X.txt, where the 9X is the topic. In the file is the PubMed search statement that generated the output, which was filtered for the time period of the MEDLINE subset (1994-2003).

In the even-numbered files (90, 92, 94, 96, 98), there is a tag DR, PR, or NR before each MEDLINE record that represents the relevance judgment. The tag is right before the number of the document in the PubMed output, e.g., DR1 represents that the first record in the PubMed output for this particular search was definitely relevant. In the odd-numbered files (91, 93, 95, 97, 99), the tag DR, PR, or NR that represents the relevance judgment is after each MEDLINE record.

Please note that neither the searches nor the judgments should be considered complete. In addition, it is possible that some MEDLINE records are not in the MEDLINE subset while some that are in the test collection and should have been retrieved are not there. These searches and judgments are simply provided "as is" for those who want to see some for these topics.

The official topics are available on active user portion of the track Web site. A password is required to access the data that can ONLY be obtained from Lori Buckland of NIST (please do not email me to ask for it!). The names of the files are:

adhoc2005narrative.txt - Narrative version of official topics (100-149)
adhoc2005topics.doc - Tabular (Microsoft Word) version of official topics (100-149)
adhoc2005topics.pdf - Tabular (PDF) version of official topics (100-149)

Documents

The document collection for the 2005 ad hoc retrieval task was the same 10-year MEDLINE subset using for the 2004 track. One goal we had was to produce a number of topic and relevance judgment collections that use this same document collection to make retrieval experimentation easier (so people do not have to load different collections into their systems). More uses of this subset will be forthcoming later.

More detail about the document collection is available on the 2004 protocol page and 2004 data page, although highly pertinent information is reproduced here. The document collection for the task is a 10-year subset of the MEDLINE bibliographic database of the biomedical literature. MEDLINE can be searched by anyone in the world using the PubMed system of the National Library of Medicine (NLM), which maintains both MEDLINE and PubMed. The full MEDLINE database contains over 13 million references dating back to 1966 and is updated on a daily basis.

The subset of MEDLINE for the TREC 2005 Genomics Track consists of 10 years of completed citations from the database inclusive from 1994 to 2003. Records were extracted using the Date Completed (DCOM) field for all references in the range of 19940101 - 20031231. This provided a total of 4,591,008 records, which is about one third of the full MEDLINE database. A gzipped list of the PubMed IDs (PMIDs) is available (10.6 megabytes compressed).

The data includes all of the PubMed fields identified in the MEDLINE Baseline record and only the PubMed-centric tags are removed from the XML version. A description of the various fields of MEDLINE are available at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#MEDLINEDisplayFormat

It should also be noted that not all MEDLINE records have abstracts, usually because the article itself does not have an abstract. In general, about 75% of MEDLINE records have abstracts. In our subset, there are 1,209,243 (26.3%) records without abstracts.

The MEDLINE subset is available in the "MEDLINE" format, which consists of ASCII text with fields indicated and delimited by 2-4 character abbreviations. The size of the file uncompressed is 9,587,370,116 bytes, while the gzipped version is 2,797,589,659 bytes. The file can be found with the 2004 files on the data portion of the Web site. An XML version of MEDLINE subset is also available; see the 2004 data page for details.

Groups contemplating using their own MEDLINE collection or filtering from "live" PubMed should be aware of the following caveats (from Jim Mork of NLM):

Some citations may differ due to revisions or corrections.
Some citations may no longer exist. The bulk of the citations come from the 2004 Baseline; some may have been removed since that baseline was created.
UTF-8 characters have been translated to 7-bit ASCII whereas the live PubMed system provides UTF-8 data and in some cases puts in characters like the inverted question mark.
The XML and ASCII files have been modified to conform to the MEDLINE Baseline format. This is a subtle difference but one that may cause changes in the files.

Relevance Judgments

Relevance judgments were done similar to the TREC 2004 Genomics Track and other TREC tracks using the "pooling" method, where the topic ranking documents from each group's best run were pooled and given to a judge with expertise in biology. The relevance judges were instructed in the following manner for each GTT:

Relevant article must describe how to conduct, adjust, or improve a standard, a, new method, or a protocol for doing some sort of experiment or procedure.
Relevant article must describe some specific role of the gene in the stated disease.
Relevant article must describe some specific role of the gene in the stated biological process.
Relevant article must describe a specific interaction (e.g., promote, suppress, inhibit, etc.) between two or more genes in the stated function of the organ or the disease.
Relevant article must describe a mutation of the stated gene and the particular biological impact(s) that the mutation has been found to have.

In general, the articles had to describe a specific gene, disease, impact, mutation, etc. and not the concept generally.

Submitted Runs

We collected other data about submitted runs besides the system output. One item was the run type, which fell into one of (at least) three categories:

Automatic - no manual intervention in building queries
Manual - manual construction of queries but no further human interaction
Interactive - completely interactive construction of queries and further interaction with system output

Recall and precision for the ad hoc retrieval task wree calculated in the classic IR way, using the preferred TREC statistic of mean average precision (average precision at each point a relevant document is retrieved, also called MAP). This was done using the trec_eval program. The code for trec_eval is available at http://trec.nist.gov/trec_eval/trec_eval.7.3.tar.gz.

The trec_eval program requires two files for input. One file is the topic-document output, sorted by each topic and then subsorted by the order of the IR system output for a given topic. This format is required for official runs submitted to NIST to obtain official scoring.

The topic-document ouptut should be formatted as follows:

100 Q0 12474524 1 5567     tag1
100 Q0 12513833 2 5543     tag1
100 Q0 12517948 3 5000     tag1
101 Q0 12531694 4 2743     tag1
101 Q0 12545156 5 1456     tag1
102 Q0 12101238 1 3.0      tag1
102 Q0 12527917 2 2.7      tag1
103 Q0 11731410 1 .004     tag1
103 Q0 11861293 2 .0003    tag1
103 Q0 11861295 3 .0000001 tag1

where:

The first column is the topic number (100-149) for the 2005 topics.
The second column is the query number within that topic. This is currently unused and must always be Q0.
The third column is the official PubMedID of the retrieved document.
The fourth column is the rank the document is retrieved
The fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order. The trec_eval program ranks documents based on the scores, not the ranks in column 4. If a submitter wants the exact ranking submitted to be evaluated, then the SCORES must reflect that ranking.
The sixth column is called the "run tag" and must be a unique identifier across all runs submitted to TREC. Thus, each run tag should have a part that identifies the group and a part that distinguishes runs from that group. Tags are restricted to 12 or fewer letters and numbers, and *NO* punctuation, to facilitate labeling graphs and such with the tags.

The second file required for trec_eval is the relevance judgments, which are called "qrels" in TREC jargon. More information about qrels can be found at http://trec.nist.gov/data/qrels_eng/ . The qrels file is in the following format:

100    0    12474524    1
101    0    12513833    1
101    0    12517948    1
101    0    12531694    1
101    0    12545156    1
102    0    12101238    1
102    0    12527917    1
103    0    11731410    1
103    0    11861293    1
103    0    11861295    1
103    0    12080468    1
103    0    12091359    1
103    0    12127395    1
103    0    12203785    1

where:

The first column is the topic number (100-149) for the 2005 topics.
The second column is always 0.
The third column is the PubMedID of the document.
The fourth column is always 1.

Categorization Task

The second task for the 2005 track was a categorization task. It was similar in part to the 2004 categorization task in that it used data from the Mouse Genome Informatics (MGI) system and was a document triage task. It included another running of one subtask from 2004, the triage of articles for GO annotation, and added triage of articles for three other major types of information collected and catalogued by MGI. These included articles about:

Tumor biology
Embryologic gene expression
Alleles of mutant phenotypes

As such, the categorization task looked at how well systems could categorize documents for four categories (the listed three plus plus GO annotation). We used the same utility measure used last year but with different parameters (see below). We created an updated version of the cat_eval program that calculates the utility measure plus recall, precision, and the F score. We calculated utility for each of the four categorization tasks separately.

For more information about the MGI systems and its components that we will be triaging documents from, consult the following references (not all of these are freely available on the Web):

Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, and the members of the Mouse Genome Database Group. 2005. The Mouse Genome Database (MGD): from genes to mice--a community resource for mouse biology. Nucleic Acids Res 2005; 33: D471-D475.
Strivens M, Eppig JT. 2004. Visualizing the laboratory mouse: capturing phenotype information. Genetica 122: 89-97.
Hill DP, Begley DA, Finger JH, Hayamizu TF, McCright IJ, Smith CM, Beal JS, Corbani LE, Blake JA, Eppig JT, Kadin JA, Richardson JE, Ringwald M. The Mouse Gene Expression Database (GXD): updates and enhancements. Nucleic Acids Res 2004; 32:D568-D571.
Näf D, Krupke DM, Sundberg JP, Eppig JT, Bult CJ. 2002. The Mouse Tumor Biology database: a public resource for cancer genetics and pathology of the mouse. Cancer Res 62(5):1235-40.

Documents

The documents for the 2004 categorization tasks consisted of the same full-text articles used in 2003. The articles came from three journals over two years, reflecting the full-text data we were able to obtain from Highwire Press: Journal of Biological Chemistry (JBC), Journal of Cell Biology (JCB), and Proceedings of the National Academy of Science (PNAS). These journals have a good proportion of mouse genome articles. Each of the papers from these journals is in SGML format. Highwire's DTD and its documentation are available. Also the same as 2004, we designated 2002 articles as training data and 2003 articles as test data. The documents for the tasks came from a subset of these articles that had the words "mouse" or "mice" or "mus" as described in the 2004 protocol. A crosswalk or look-up table was provided that matches an identifier for each Highwire article (its file name) and its corresponding PubMed ID (PMID). The table below shows the total number of articles and the number in the subset the track used.

Journal	2002 papers - total, subset	2003 papers - total, subset	Total papers - total, subset
JBC	6566, 4199	6593, 4282	13159, 8481
JCB	530, 256	715, 359	1245, 615
PNAS	3041, 1382	2888, 1402	5929, 2784
Total papers	10137, 5837	10196, 6043	20333, 11880

The following table lists the the files containing the documents and related data. The files can be found on the active user portion of the Web site.

File contents	Training data file name	Test data file name
Full-text document collection in Highwire SGML format (in 2004 data directory)	train.tar.Z	test.tar.Z
Crosswalk files of PMID, Highwire file name of article, journal code, and year of paper publication (in 2005 data directory)	train.crosswalk.txt	test.crosswalk.txt

The SGML training document collection is 150 megabytes in size compressed and 449 megabytes uncompressed. The SGML test document collection is 140 megabytes compressed and 397 megabytes uncompressed. Many gene names have Greek or other non-English characters, which can present a problem for those attempting to recognize gene names in the text. The Highwire SGML appears to obey the rules posted on the NLM Web site with regards to these characters (http://www.ncbi.nlm.nih.gov/entrez/query/static/entities.html).

Evaluation measures

While used the utility measure as the primary evaluation measure but in a slightly different way in 2005. This was because there are varying numbers of positive examples for the four different categorization tasks.

The framework for evaluation in the categorization task was based on the following table of possibilities:

	Relevant (classified)	Not relevant (not classified)	Total
Retrieved	True positive (TP)	False positive (FP)	All retrieved (AR)
Not retrieved	False negative (FN)	True negative (TN)	All not retrieved (ANR)
	All positive (AP)	All negative (AN)

The measure for evaluation was the utility measure often applied in text categorization research and used by the former TREC Filtering Track. This measure contains coefficients for the utility of retrieving a relevant and retrieving a nonrelevant document. We used a version that was normalized by the best possible score:

U_norm = U_raw / U_max

For a test collection of documents to categorize, U_raw is calculated as follows:

U_raw = (u_r * TP) + (u_nr * FP)

where:

u_r = relative utility of relevant document
u_nr = relative utility of nonrelevant document

For our purposes, we assume that u_nr = -1 and solve for u_r using MGI's current practice of triaging everything:

0.0 = u_r*AP - AN
u_r = AN/AP

AP and AN were different for each task, as shown in the following table:

TASK	TRAIN/TEST	AP	AN	N	u_r
A (alelle)	TRAIN	338	5499	5837	16.27
A (alelle)	TEST	332	5711	6043	17.20
E (expression)	TRAIN	81	5756	5837	71.06
E (expression)	TEST	105	5938	6043	56.55
G (GO annotation)	TRAIN	462	5375	5837	11.63
G (GO annotation)	TEST	518	5525	6043	10.67
T (tumor)	TRAIN	36	5801	5837	161.14
T (tumor)	TEST	20	6023	6043	301.15

(Yes, the numbers for GO annotation were different from the 2004 data. This is because additional articles were triaged by MGI since we collected the data last year.)

The u_r's for A and G are fairly close across the training and test collections, the u_r's for E and especially T vary much more. We therefore established a u_r that was the average of that computed for the training and test collection, rounded to the nearest whole number. That resulted in this set of u_r's for each task:

TASK	u_r
A (alelle)	17
E (expression)	64
G (GO annotation)	11
T (tumor)	231

In order to facilitate calculation of the modified version of the utility measure for the 2005 track, we updated the cat_eval program to version 2.0, which included a command-line parameter to set u_r. Here is documentation for the program:

The program has been tested on Windows and Solaris but should run on just about any OS where you can install Python. Similar to trec_eval, there is very little error-checking for proper data format, so make sure your data files are in the format specified on the track protocol page. If you have any issues, please email Aaron Cohen at cohenaa@ohsu.edu.

Installation:
For cat_eval2.py, you need Python 2.3 or above installed. Install as you would any script.
For cat_eval2.zip unzip to a directory on a Windows machine and execute cat_eval2.exe, no Python installation necessary.

Usage: cat_eval.py data_file gold_standard_file Urelevant [-tab]
The required Urelevant parameter needs to be set to the proper weight for each of the sub-tasks.
The optional -tab argument makes the script output results in tab delimited format instead of the console readable format.

There is one sample output file for each sub-task in the active user portion of the track Web site:

Allele task: sample.Atrain.txt
Expression task: sample.Etrain.txt
GO annotation task: sample.Gtrain.txt
Tumor task: sample.Ttrain.txt

The cat_eval program expects a separate file for each run of each task, where the file has three tab-separated fields:

triageG    12189157    sample
triageG    12189154    sample
triageG    12393878    sample
triageG    12451176    sample
triageG    12209011    sample
triageG    12209014    sample

where:

The first column is the task, i.e., one of triageA, triageE, triageG, or triageT.
The second column is the PMID documents classified as positive for this task.
The run tag is a short institution and run identifier that distinguishes the run of the group. Tags are restricted to 12 or fewer letters and numbers, and no punctuation, to facilitate labeling graphs and such with the tags.

Here is sample output you should get from the sample Expression sub-task:

cat_eval.py sample.E+train.txt triage.E+train.txt 64.0
Run: sample
Counts: tp=81; fp=2538; fn=0
Precision: 0.0309
Recall: 1.0000
F-score: 0.0600
Utility Factor: 64.00
Raw Utility: 2646
Max Utility: 5184
Normalized Utility: 0.5104

The official results will calculate U_norm for each of the four categories as well as an overall mean of the four.

Participants were strongly encouraged to adhere to a naming convention for their run tags that has the first letter of the tag designating the specific run: a for allele, e for expression, g for GO, and t for tumor.

Data

The training data came in four files, one for each category (i.e., A, E, G, and T). (The fact that three of these four correspond to the four nucleotides in DNA is purely coincidental!) They were named as Atrain.txt, Etrain.txt, etc. and are available in the active user portion of the track Web site. The test data obey the same naming conventions.

What Resources Can Be Used?

A common question was, what resources can be legitimately used to aid in categorizing the documents? In general, we allowed use of anything, including resources on the MGI Web site. The only resource particpiants could not use was the direct data itself, i.e., data that is directly linked to the PMID or the associated MGI unique identifier. Thus, they could not go into the MGI database (or any other aggregated resource such as Entrez Gene or SOURCE) and pull out GO codes, tumor terms, mutant phenotypes, or any other data that was explicitly linked to a document. But anything else was fair game.

We also made available a cheatsheet developed by MGI for its curators who triage documents. This version of the sheet was a couple years old, but given that the articles and data we are using were also that old, this sheet may actually have been more appropriate than an up-to-date one would have been. As with the sample searches in the ad hoc task, this was provided on an "as is" basis, with no guarantees it would be helpful. The appropriate "Areas" on the sheet that correlate to the categories we were triaging to are:

Alleles and phenotypes - A
Expression - E
Gene Ontology - G
Tumor - T

Automated tagging of mouse genes in MEDLINE corpus

Aaron Cohen of OHSU processed the entire TREC 2004 10 year Medline subset with the mouse named-entity recognizer and normalizer (NER+N) that we presented at BioLINK2005. There are two files in the TREC 2004 genomics data directory that correspond to the two halves of the split 10 year subset:

2004_TREC_MEDLINE_1_MGI_FOR_PMID.tsv.txt corresponds to 2004_TREC_MEDLINE_1.gz
2004_TREC_MEDLINE_2_MGI_FOR_PMID.tsv.txt corresponds to 2004_TREC_MEDLINE_2.gz

(Note that these files are from the MEDLINE corpus for the ad hoc task, but all of the full-text Highwire documents are in this corpus and can be linked using the crosswalk files.)

The files are in tab-separated format, one line for each PMID. The first field is the PMID of the MEDLINE record, followed by a variable number of fields which are the MGI identifiers for the genes found by our system in that record. Only MEDLINE records found to contain one or more mouse genes are included. The files are about 18 megabytes each.

This could be a useful addition to the corpus. However, the files were automatically generated and have not been manually reviewed, so some errors are to be expected. In Aaron's BioLink paper (available at http://acl.ldc.upenn.edu/W/W05/W05-1303.pdf), he reported a precision of 0.775 at a recall of 0.726 for mouse gene NER+N on the BioCreative test collection, which is competitive with the state of the art.