TREC 2004 Genomics Track Final Protocol

William Hersh
Oregon Health & Science University, Track Chair
hersh@ohsu.edu

Last updated - February 23, 2005

This file contains the final protocol for the TREC 2004 Genomics Track. After the regular track cycle was completed, the data for the categorization task was updated to reflect additional changes to the Mouse Genomic Informatics (MGI) database that had been done while the track was taking place. This was done to provide the most up-to-date data to those continuing to perform experiments with track data. An update file describes the changes to the data.

A total of 33 groups participated in the 2004 Genomics Track, making it the track with the most participants in all of TREC 2004. A toal of 145 runs were submitted. For the ad hoc task, there were 47 runs from 27 groups, while for the categorization task, there were 98 runs from 20 groups. The runs of the categorization task were distributed across the subtasks as follows: 59 for the triage subtask, 36 for the annotation hierarchy subtask, and three for the annotation hierarchy plus evidence code subtask.

The data will be available to non-TREC participants in early 2005. A listing of the available data is available.

Introduction

This document contains the final protocol for the TREC 2004 Genomics Track. This protocol had the same general structure as the 2003 track, in that there was a basic information retrieval task and a more experimental text categorization task. We decided, however, to not use the words "primary" and "secondary" as we did in 2003 to describe the two tasks. Instead, we called the two tasks "ad hoc retrieval" and "categorization." The ad hoc retrieval task focused on "regular" document retrieval, while the categorization task focused on classifying documents containing "experimental evidence" allowing assignment of GO codes.

The roadmap for the Genomics Track called for modifying one experimental "facet" each year. For the purposes of the roadmap (based on the NSF grant proposal), last year (2003) was Year 0. This means that 2004 was Year 1. The original plan was to add new types of content in Year 1 and new types of information needs in Year 2. Because we were unable to secure substantial numbers of full text documents for the ad hoc retrieval task in 2004, we decided to reverse the order of the roadmap for Years 1 and 2. This means we focused on new types of information needs for 2004.

There were some resources we made extensive use of in 2004 for the categorization task:

The Gene Ontology (GO) - a controlled terminology used to annotate the function of genes and proteins.
Mouse Genome Informatics (MGI) - the model organism database for the mouse, which, among other things, links literature references to gene function through the use of GO codes.

It is important to have a general understanding of these resources, including the codes used to assign the evidence supporting an annotation using GO. Here are some pointers to get started:

Ad Hoc Retrieval Task

The structure of this task was a conventional searching task based on a 10-year subset of MEDLINE (about 4.5 million documents and 9 gigabytes in size) and 50 topics dervied from information needs obtained via interviews of real biomedical researchers. There was no training data, although sample topics and relevance judgments were available.

Documents

The document collection for this task was a 10-year subset of the MEDLINE bibliographic database of the biomedical literature. MEDLINE can be searched by anyone in the world using the PubMed system of the National Library of Medicine (NLM), which maintains both MEDLINE and PubMed. The full MEDLINE database contains over 13 million references dating back to 1966 and is updated on a daily basis.

The subset of MEDLINE used for the TREC 2004 Genomics Track consisted of 10 years of completed citations from the database inclusive from 1994 to 2003. Records were extracted using the Date Completed (DCOM) field for all references in the range of 19940101 - 20031231. This provided a total of 4,591,008 records, which is about one third of the full MEDLINE database. A gzipped list of the PubMed IDs (PMIDs) is available (10.6 megabytes compressed).

We used the DCOM field and not the Date Published (DP). DCOM and DP are not linked in any formal manner. There are plenty of cases where something is not indexed until quite long after it was published. This may be due to several reasons including the indexer backlog, a new interest in indexing a particular journal - where they have to go back and index the collection, etc.. So this is not an inconsistency or problem with the data.

We pulled the collection using the DCOM field and did not worry about when the articles were published, last revised (LR), or created (DA). We had to pick a date to use to define the boundaries for the 10 years of MEDLINE and the DCOM field was chosen with the thought that it would provide us with a good indicator. The numbers for the complete 4,591,008 collection are as follows:

2,814 ( 0.06%) DPs prior to 1980
8,388 ( 0.18%) DPs prior to 1990
138,384 ( 3.01%) DPs prior to 1994

The remaining 4,452,624 (96.99%) DPs were within the 10 year period of 1994-2004.

The data includes all of the PubMed fields identified in the MEDLINE Baseline record and only the PubMed-centric tags are removed from the XML version. A description of the various fields of MEDLINE are available at:
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#MEDLINEDisplayFormat
The only field at that link not in the subset is the Entrez Date (EDAT) field because that information is not included in the MEDLINE Baseline record. It should also be noted that not all MEDLINE records have abstracts, usually because the article itself does not have an abstract. In general, about 75% of MEDLINE records have abstracts. In our subset, there are 1,209,243 (26.3%) records without abstracts.

The MEDLINE subset is available in two formats:

MEDLINE - ASCII text with fields indicated and delimited by 2-4 character abbreviations (uncompressed - 9,587,370,116 bytes, gzipped - 2,797,589,659 bytes)
XML - NLM XML format (uncompressed - 20,567,278,551 bytes, gzipped - 3,030,576,659 bytes)

Groups contemplating using their own MEDLINE collection or filtering from "live" PubMed should be aware of the following caveats (from Jim Mork of NLM):

Some citations may differ due to revisions or corrections.
Some citations may no longer exist. The bulk of the citations come from the 2004 Baseline; some may have been removed since that baseline was created.
UTF-8 characters have been translated to 7-bit ASCII whereas the live PubMed system provides UTF-8 data and in some cases puts in characters like the inverted question mark.
The XML and ASCII files have been modified to conform to the MEDLINE Baseline format. This is a subtle difference but one that may cause changes in the files.

Topics

The ad hoc retrieval task consisted of 50 topics derived from interviews eliciting information needs of real biologists. We collected a total of 74 information needs statements. These were then honed into topics that represent reasonable kinds of information needs for which a biologist might search the literature and that retrieve a reasonable (i.e., not zero and not thousands) number of relevant documents. The topics are formatted in XML and have the following fields:

ID - 1 to 50
Title - abbreviated statement of information need
Information need - full statement information need
Context - background information to place information need in context

We also created an additional five sample topics to demonstrate what the topics looked like before the official topics were available. These did not have relevance judgments.

Relevance judgments

Relevance judgments were done using the conventional "pooling method" whereby a fixed number of top-ranking documents (depending on the total number of documents and available resources for judging) from each offiicial run are pooled and provided to an individual (blinded to the query statement and participant who they came from) who judges them as definitely relevant (DR), possibly relevant (PR), or not relevant (NR) to the topic. A set of documents were also judged in duplicate to assess interjudge reliability. For the standpoint of the official results, which require binary relevance judgments, documents that were rated DR or PR were considered relevant.

The judgments were done by two individuals with backgrounds in biology. The pools were built from the top-precedence run from each of the 27 groups. We took the top 75 documents for each topic and eliminated the duplicates to create a single pool for each. The average pool size (average number of documents judged per topic) was 976, with a range of 476-1450. The table below shows the total number of judgments and their distribution for the 50 topics.

There are two files of relevance judgments:

04.judgments.txt - a file of all relevance judgments done, based on the format:
<topic><PMID><judgment>
where <judgment> = 1 (DR), 2 (PR), or 3 (NR)

04.qrels.txt - a file of relevant documents, based on the format:
<topic><0><PMID><judgment>

The qrels file is used for input to the trec_eval program to generate recall-precision results. The trec_eval program considers any document with a judgment greater than 0 to be relevant, which in the trecgen04-qrels.txt file is all documents. Those wanting to experiment with DR-only relevance judgments can do so, although only 47 topics can be used (since 3 have no DR documents).

Topic	Total Judgments	Definitely relevant	Possibly relevant	Not relevant	Definitely and probably relevant
1	879	38	41	800	79
2	1264	40	61	1163	101
3	1189	149	32	1008	181
4	1170	12	18	1140	30
5	1171	5	19	1147	24
6	787	41	53	693	94
7	730	56	59	615	115
8	938	76	85	777	161
9	593	103	12	478	115
10	1126	3	1	1122	4
11	742	87	24	631	111
12	810	166	90	554	256
13	1118	5	19	1094	24
14	948	13	8	927	21
15	1111	50	40	1021	90
16	1078	94	53	931	147
17	1150	2	1	1147	3
18	1392	0	1	1391	1
19	1135	0	1	1134	1
20	814	55	61	698	116
21	676	26	54	596	80
22	1085	125	85	875	210
23	915	137	21	757	158
24	952	7	19	926	26
25	1142	6	26	1110	32
26	792	35	12	745	47
27	755	19	10	726	29
28	836	6	7	823	13
29	756	33	10	713	43
30	1082	101	64	917	165
31	877	0	138	739	138
32	1107	441	55	611	496
33	812	30	34	748	64
34	778	1	30	747	31
35	717	253	18	446	271
36	676	164	90	422	254
37	476	138	11	327	149
38	1165	334	89	742	423
39	1350	146	171	1033	317
40	1168	134	143	891	277
41	880	333	249	298	582
42	1005	191	506	308	697
43	739	25	170	544	195
44	1224	485	164	575	649
45	1139	108	48	983	156
46	742	111	86	545	197
47	1450	81	284	1085	365
48	1121	53	102	966	155
49	1100	32	41	1027	73
50	1091	79	223	789	302
Total	48753	4629	3639	40485	8268

As noted above, we performed some overlapping judgments to assess interjudge consistency using Cohen's kappa measure. The table below shows the results of the overlapping judgments. The value obtained for Cohen's kappa was 0.51, indicating "fair" consistency.

Judge 1 / Judge 2	Definitely relevant	Possibly relevant	Not relevant	Total
Definitely relevant	62	35	8	105
Possibly relevant	11	11	5	27
Not relevant	14	57	456	527
Total	87	103	469	659

Results

Recall and precision for the ad hoc retrieval task were calculated in the classic IR way, using the preferred TREC statistic of mean average precision (average precision at each point a relevant document is retrieved, also called MAP). This was done using the standard TREC approach of participants submitting their results in the format for input to Chris Buckley’s trec_eval program. The code for trec_eval is available at ftp://ftp.cs.cornell.edu/pub/smart/ . There are several versions of trec_eval, which differ mainly in the statistics they calculate in their output. We recommend the following version of trec_eval, which should compile with any C compiler: ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar .

Please note a few "quirks" about trec_eval:

It uses some Unix-specific system calls so would require considerable modification to run on another platform.
The aggregate statistics that are presented at the end of the file (averages for precision at points of recall, average precision, etc.) only include queries for which one or more documents were retrieved. Therefore, you should insert a "dummy" document in your output for queries that retrieve no real documents so that your aggregate scores are averaged over all 50 documents.

The trec_eval program requires two files for input. One file is the topic-document output, sorted by each topic and then subsorted by the order of the IR system output for a given topic. This format is required for official runs submitted to NIST to obtain official scoring.

The topic-document ouptut should be formatted as follows:

1 Q0 12474524 1 5567     tag1
1 Q0 12513833 2 5543     tag1
1 Q0 12517948 3 5000     tag1
1 Q0 12531694 4 2743     tag1
1 Q0 12545156 5 1456     tag1
2 Q0 12101238 1 3.0      tag1
2 Q0 12527917 2 2.7      tag1
3 Q0 11731410 1 .004     tag1
3 Q0 11861293 2 .0003    tag1
3 Q0 11861295 3 .0000001 tag1

where:

The first column is the topic number (1-50).
The second column is the query number within that topic. This is currently unused and must always be Q0.
The third column is the official PubMedID of the retrieved document.
The fourth column is the rank the document is retrieved
The fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order. The trec_eval program ranks documents based on the scores, not the ranks in column 4. If a submitter wants the exact ranking submitted to be evaluated, then the SCORES must reflect that ranking.
The sixth column is called the "run tag" and must be a unique identifier across all runs submitted to TREC. Thus, each run tag should have a part that identifies the group and a part that distinguishes runs from that group. Tags are restricted to 12 or fewer letters and numbers, and *NO* punctuation, to facilitate labeling graphs and such with the tags.

The second file required for trec_eval is the relevance judgments, which are called "qrels" in TREC jargon. More information about qrels can be found at http://trec.nist.gov/data/qrels_eng/ . The qrels file is in the following format:

1    0    12474524    1
1    0    12513833    1
1    0    12517948    1
1    0    12531694    1
1    0    12545156    1
2    0    12101238    1
2    0    12527917    1
3    0    11731410    1
3    0    11861293    1
3    0    11861295    1
3    0    12080468    1
3    0    12091359    1
3    0    12127395    1
3    0    12203785    1

where:

The first column is the topic number (1-50).
The second column is always 0.
The third column is the PubMedID of the document.
The fourth column is always 1.

Experiments

Each group was allowed to submit up to two official runs. Each was classified into one of the following two categories:

Automatic - no system or query tuning for the specific topics (systems can be tuned for the sample queries without calling the runs manual)
Manual - with system or query tuning for the specific topics

Categorization Task

A major activity of most model organism database projects is to assign codes from the Gene Ontology (GO) to annotate the function of genes and proteins. The GO consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated (1) biological processes, (2) cellular components, and (3) molecular functions. Each assigned GO code is also designated with a level of evidence indicating the specific experimental evidence in support of assignment of the code. For more information on these, visit the GO Web site .

In the categorization task, we attempted to mimic two of the classification acitivites carried out by human annotators in the mouse genome informatics (MGI) system: a triage task and two variants of MGI's annotation task. Systems were required to classify full-text documents from a two-year span (2002-2003) of three journals. The first year's (2002) documents comprised the training data, while the second year's (2003) documents made up the test data.

Like most tracks within TREC, the Genomics Track operates on the honor system whereby participants follow rules to insure that inappropriate data is not used and results are comparable across systems. In this task, many of the classification answers are obtainable by searching MGI. In addition, data for the annotation task can be used to infer answers for the triage task. It is therefore imperative that users not make inappropriate use of MGI or other data, especially for the test data.

Background

One of the goals of MGI is to provide structured, coded annotation of gene function from the biological literature. Human curators identify genes and assign GO codes about gene function with another code describing the experimental evidence for the GO code. The huge amount of literature to curate creates a challenge for MGI, as their resources are not unlimited. As such, they employ a three-step process to identify the papers most likely to describe gene function:

About mouse - The first step is to identify articles about mouse genomics biology. Articles from several hundred journals are searched for the words mouse, mice, or murine. Articles caught in this "mouse trap" are further analyzed for inclusion in MGI. At present, articles are search in a Web browser one at a time because full-text searching is not available for all of the journals.
Triage - The second step is to determine whether the identified articles should be sent for curation. MGI curates articles not only for GO terms, but also for other aspects of biology, such as gene mapping, gene expression data, phenotype description, and more. The goal of this triage process is to limit the number of articles sent to human curators for more exhaustive analysis. Articles that pass this step go into the MGI system with a tag for GO, mapping, expression, etc.. The rest of the articles do not go into MGI. Our triage task involved correctly classifying which documents have been selected for GO annotation in this process.
Annotation - The third step is the actual curation with GO terms. Curators identify genes for which there is experimental evidence to warrant assignment of GO codes. Those GO codes are assigned, along with a code for each indicating the type of evidence. There can more than one gene assigned GO codes in a given paper and there can be more than one GO code assigned to a gene. In general, and in our collection, there is only one evidence code per GO code assignement per paper. Our annotation task involved a modification of this annotation step as described below.

Documents

The documents for the tasks consisted of articles from three journals over two years, reflecting the full-text data we were able to obtain from Highwire Press. The journals available were Journal of Biological Chemistry (JBC), Journal of Cell Biology (JCB), and Proceedings of the National Academy of Science (PNAS). These journals have a good proportion of mouse genome articles. Each of the papers from these journals was in SGML format. Highwire's DTD and its documentation are available. We designated 2002 articles as training data and 2003 articles as test data. The documents for the tasks came from a subset of these articles that get caught by the "mouse trap" described above. A crosswalk or look-up table was provided that matched an identifier for each Highwire article (its file name) and its corresponding PubMed ID (PMID). The table below shows the total number of articles and the number in the subset the track used.

Journal	2002 papers - total, subset	2003 papers - total, subset	Total papers - total, subset
JBC	6566, 4199	6593, 4282	13159, 8481
JCB	530, 256	715, 359	1245, 615
PNAS	3041, 1382	2888, 1402	5929, 2784
Total papers	10137, 5837	10196, 6043	20333, 11880

The following table lists the the files containing the documents and related data.

File contents	Training data file name	Test data file name
Full-text document collection in Highwire SGML format	train.tar.Z	test.tar.Z
Crosswalk files of PMID, Highwire file name of article, journal code, and year of paper publication	train.crosswalk.txt	test.crosswalk.txt
Positive training examples	triage+train.txt	triage+test.txt

The SGML training document collection is 150 megabytes in size compressed and 449 megabytes uncompressed. The SGML test document collection is 140 megabytes compressed and 397 megabytes uncompressed.

Many gene names have Greek or other non-English characters, which can present a problem for those attempting to recognize gene names in the text. The Highwire SGML appears to obey the rules posted on the NLM Web site with regards to these characters (http://www.ncbi.nlm.nih.gov/entrez/query/static/entities.html).

Triage Task

The goal of this task was to correctly identify which papers were deemed to have experimental evidence warranting annotation with GO codes. Positive examples were papers designated for GO annotation by MGI. However, some of these papers had not yet been annotated. Negative examples were all papers not designated for GO annotation in the operational MGI system. For the training data (2002), there were 375 positive examples, meaning that there were 5837-375 = 5462 negative examples. For the test data (2003), there were 420 positive examples, meaning that there were 6043-420 = 5623 negative examples.

Annotation Task

The primary goal of this task was to correctly identify, given the article and gene name, which of the GO hierarchies (also called domains) have terms within them that have been annotated. Note that the goal of this task was not to select the correct GO term, but rather to select the one or more GO hierarchies (biological process, cellular component, and molecular function, also called domains) from which terms had been selected to annotate the gene for the article. Papers which had been annotated had from one to three hierarchies. There were 328 papers in the collection that had GO terms assigned.

We also included a batch of 558 papers that had a gene name assigned but were used for other purposes by MGI. As such, these papers had no GO annotations. These papers did, however, have one or more gene assigned for the purposes.

A secondary goal of this task was to identify the correct GO evidence code that goes along with the hierarchy code.

Task Data

The figure below shows where the positive and negative examples for the test and training came from in the MGI system.
Data sources

Triage task data

The following files contain the positive examples (negative examples can be obtained by subtracting these documents from the crosswalk files above):

Text file of positive examples for training data
- triage+train.txt
Text file of positive examples for test data
- triage+test.txt

Annotation task data

The table below shows the contents, names, and line counts of the data files for this task. Here is an interpretation of the numbers in the table: For the training data, there are a total of 504 documents that are either positive (one or more GO terms assigned) and negative (no GO terms assigned) examples. From these documents, a total of 1291 genes have been assigned by MGI. (The file gtrain.txt containts the MGI identifier, the gene symbol, and the gene name. It does not contain any other synonyms.) There are 1418 unique document-gene pairs in the training data. The data from the first three rows of the table differ from the rest in that they contain data merged from positive and negative examples. These are what would be used as input for systems to nominate GO domains or the GO domains plus their evidence codes per the annotation task. When the test data are released, these three files are the only ones that will be provided.

For the positive examples in the training data, there are 178 documents and 346 document-gene pairs. There are 589 document-gene name-GO domain tuples (out of a possible 346 * 3 = 1038). There are 640 document-gene name-GO domain-evidence code tuples. A total of 872 GO plus evidence codes have been assigned to these documents.

For the negative examples, there are 326 documents and 1072 document-gene pairs. This means that systems could possibly assign 1072*3 = 3216 document-gene name-GO domain tuples.

File contents	Training data file name	Training data count	Test data file name	Test data count
Documents - PMIDs	ptrain.txt	504	ptest.txt	378
Genes - Gene symbol, MGI identifier, and gene name for all used	gtrain.txt	1294	gtest.txt	777
Document gene pairs - PMID-gene pairs	pgtrain.txt	1418	pgtest.txt	877
Positive examples - PMIDs	p+train.txt	178	p+test.txt	149
Positive examples - PMID-gene pairs	pg+train.txt	346	pg+test.txt	295
Positive examples - PMID-gene-domain tuples	pgd+train.txt	589	pgd+test.txt	495
Positive examples - PMID-gene-domain-evidence tuples	pgde+train.txt	640	pgde+test.txt	522
Positive examples - all PMID-gene-GO-evidence tuples	all+train.txt	872	all+test.txt	693
Negative examples - PMIDs	p-train.txt	326	p-test.txt	229
Negative examples - PMID-gene pairs	pg-train.txt	1072	pg-test.txt	582

The next table shows the actual data in each of the above files. The data are in ASCII format. The field names in are delimited below with vertical bars, but in the files, the field contents are delimited with tab characters.

Training data file name	Data in file
ptrain.txt	PMID
gtrain.txt	Gene symbol\|MGI ID\|Gene name
pgtrain.txt	PMID\|Gene symbol
p+train.txt	PMID
pg+train.txt	PMID\|Gene symbol
pgd+train.txt	PMID\|Gene symbol\|GO domain
pgde+train.txt	PMID\|Gene symbol\|GO domain\|GO evidence code
all+train.txt	PMID\|MGI ID\|Gene symbol\|Gene name\|GO domain\|GO code\|GO name\|GO evidence code
p-train.txt	PMID
pg-train.txt	PMID\|Gene symbol

Examples

Some examples from the data may make the above more clear. Let's start with a positive example and consider the following paper:
Morinobu A, Gadina M, Strober W, Visconti R, Fornace A, Montagna C, Feldman GM, Nishikomori R, O'Shea JJ. STAT4 serine phosphorylation is critical for IL-12-induced IFN-gamma production but not for cell proliferation. Proc Natl Acad Sci U S A. 2002 Sep 17;99(19):12281-6. Epub 2002 Sep 04. PMID: 12213961 [PubMed - indexed for MEDLINE].

The PMID for this paper is 12213961. Using grep, we can look at the instances of this documents in the various files (long lines truncated):

$ grep 12213961 *.txt         
all+train.txt:12213961  MGI:107776      Gadd45b growth arrest and DNA-damage-inA
all+train.txt:12213961  MGI:1346325     Gadd45g growth arrest and DNA-damage-inA
all+train.txt:12213961  MGI:1346325     Gadd45g growth arrest and DNA-damage-inS
all+train.txt:12213961  MGI:1346325     Gadd45g growth arrest and DNA-damage-inS
all+train.txt:12213961  MGI:1346870     Map2k6  mitogen activated protein kinasA
all+train.txt:12213961  MGI:103062      Stat4   signal transducer and activatorA
all+train.txt:12213961  MGI:103062      Stat4   signal transducer and activatorA
all+train.txt:12213961  MGI:103062      Stat4   signal transducer and activatorA
all+train.txt:12213961  MGI:103062      Stat4   signal transducer and activatorA
p+train.txt:12213961
pg+train.txt:12213961   Gadd45b         
pg+train.txt:12213961   Gadd45g         
pg+train.txt:12213961   Map2k6
pg+train.txt:12213961   Stat4 
pgd+train.txt:12213961  Gadd45b BP      
pgd+train.txt:12213961  Gadd45g BP      
pgd+train.txt:12213961  Map2k6  BP      
pgd+train.txt:12213961  Stat4   MF      
pgd+train.txt:12213961  Stat4   CC      
pgd+train.txt:12213961  Stat4   BP      
pgde+train.txt:12213961 Gadd45b BP      IDA       
pgde+train.txt:12213961 Gadd45g BP      IDA       
pgde+train.txt:12213961 Gadd45g BP      TAS       
pgde+train.txt:12213961 Map2k6  BP      IDA       
pgde+train.txt:12213961 Stat4   MF      IDA       
pgde+train.txt:12213961 Stat4   CC      IDA       
pgde+train.txt:12213961 Stat4   BP      IDA       
pgtrain.txt:12213961    Gadd45b         
pgtrain.txt:12213961    Gadd45g         
pgtrain.txt:12213961    Map2k6
pgtrain.txt:12213961    Stat4 
ptrain.txt:12213961 
train.crosswalk.txt:pq1902012281.gml    12213961        PNAS    2002  
triage+train.txt:12213961

From the train.crosswalk.txt file, we can find the full-text Highwire SGML version of the paper, which has the file name pq1902012281.gml (and is in train.tar.Z). This paper has GO codes, so is a positive example for the triage task and appears in the file triage+train.txt. It also is a positive example for the annotation task, so appears in the file of all annotation training papers, ptrain.txt, as well as the file of positive examples, p+train.txt. This paper is annotated with four genes: Gadd45b, Gadd45g, Map2k6, and Stat4. These genes appear with the paper in the files pgtrain.txt and pg+train.txt.

From the file all+train.txt, we can get all of the GO and evidence code assignments for these genes in this paper. There are a total of nine assignments:

12213961	MGI:107776	Gadd45b	growth arrest and DNA-damage-inducible 45 beta	BP	GO:0000186	activation of MAPKK	IDA
12213961	MGI:1346325	Gadd45g	growth arrest and DNA-damage-inducible 45 gamma	BP	GO:0000186	activation of MAPKK	IDA
12213961	MGI:1346325	Gadd45g	growth arrest and DNA-damage-inducible 45 gamma	BP	GO:0042095	interferon-gamma biosynthesis	TAS
12213961	MGI:1346325	Gadd45g	growth arrest and DNA-damage-inducible 45 gamma	BP	GO:0045063	T-helper 1 cell differentiation	TAS
12213961	MGI:1346870	Map2k6	mitogen activated protein kinase kinase 6	BP	GO:0006468	protein amino acid phosphorylation	IDA
12213961	MGI:103062	Stat4	signal transducer and activator of transcription 4	MF	GO:0003677	DNA binding	IDA
12213961	MGI:103062	Stat4	signal transducer and activator of transcription 4	CC	GO:0005634	nucleus	IDA
12213961	MGI:103062	Stat4	signal transducer and activator of transcription 4	BP	GO:0008283	cell proliferation	IDA
12213961	MGI:103062	Stat4	signal transducer and activator of transcription 4	BP	GO:0019221	cytokine and chemokine mediated signaling pathway	IDA

Gadd45b has one assignment, and as such, appears in the files pg+train.txt, pgd+train.txt, and pgde+train.txt only once. Gadd45g has three GO assignments from two different codes in the same hierarchy. It thus appears once in pg+train.txt, once in pgd+train.txt, and twice in pgde+train.txt. The latter situation occurs because even though the two GO codes are in the same hierarchy, they have different evidence codes associated with them. Map2k6 is like Gadd45b, with one assignment and occuring once in the appropriate files. Stat4 has four GO assignments that include all three GO hierarchies. As such, it appears once in pg+train.txt, three times in pgd+train.txt, and three times in pgde+train.txt.

Now let's consider a negative example based on the following paper:
Vivian JL, Chen Y, Yee D, Schneider E, Magnuson T. An allelic series of mutations in Smad2 and Smad4 identified in a genotype-based screen of N-ethyl-N- nitrosourea-mutagenized mouse embryonic stem cells. Proc Natl Acad Sci U S A. 2002 Nov 26;99(24):15542-7. Epub 2002 Nov 13. PMID: 12432092 [PubMed - indexed for MEDLINE]

This paper has PMID 12432092, so we use grep as follows:

$ grep 12432092 *.txt         
p-train.txt:12432092
pg-train.txt:12432092   Smad2 
pg-train.txt:12432092   Smad4 
pgtrain.txt:12432092    Smad2 
pgtrain.txt:12432092    Smad4 
ptrain.txt:12432092 
train.crosswalk.txt:pq2402015542.gml    12432092        PNAS    2002

As can be seen, negative examples are simpler. From the PMID 12432092, we can use the train.crosswalk.txt file to find the full-text Highwire SGML version of the paper, which has the file name pq2402015542.gml (and is also in train.tar.Z). Because the paper is in the training data set, it appears in ptrain.txt. Because it is a negative example, it appears in p-train.txt. The paper does have two genes associated (by MGI) with it, Smad2 and Smad4. These appear in the files of overall article-gene pairs and negative examples of article-gene pairs.

Gene tagging tools

Several research groups graciously allowed their open-source tools for gene tagging to be used for non-commercial purposes on an "as is" basis. These included:

YAGI (Yet Another Gene Identifier) - http://www.cs.wisc.edu/~bsettles/yagi/
LingPipe - http://alias-i.com/lingpipe
AbGene - ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/AbGene

More information about each can be found on their respective Web sites. Alex Morgan of MITRE also maintains a page of BioNLP tools:
http://www.tufts.edu/~amorga02/bcresources.html

Evaluation measures

The framework for evaluation in the categorization task was based on the following table of possibilities:

	Relevant (classified)	Not relevant (not classified)	Total
Retrieved	True positive (TP)	False positive (FP)	All retrieved (AR)
Not retrieved	False negative (FN)	True negative (TN)	All not retrieved (ANR)
	All positive (AP)	All negative (AN)

The consensus of the track was to use a utility measure for the triage task and F measure for the annotation tasks. We developed a Python program that calculated both statistics for each task.

Triage evaluation measures

The measure for the triage task was the utility measure often applied in text categorization research and used by the former TREC Filtering Track. This measure contains coefficients for the utility of retrieving a relevant and retrieving a nonrelevant document. We used a version that was normalized by the best possible score:

U_norm = U_raw / U_max

For a test collection of documents to categorize, U_raw was calculated as follows:

U_raw = (u_r * TP) + (u_nr * FP)

where:

u_r = relative utility of relevant document
u_nr = relative utility of nonrelevant document

We used values for u_r and u_nr that are driven by boundary cases for different results. In particular, we wanted the measure to have the following characteristics:

Completely perfect prediction - U_norm = 1
All documents designated positive (triage everything) - 1 > U_norm > 0
All documents designated negative (triage nothing) - U_norm = 0
Completely imperfect prediction - U_norm < 0

In order to achieve the above boundary cases, we had to set u_r > 1. The ideal approach would have been to interview MGI curators and use decision-theoretic approaches to determine their utilty. However, time did not permit us to do that. Deciding that the triage-everything approach should have a higher score than the triage-nothing approach, we estimated that a U_norm in the range of 0.25-0.3 for the triage-everything condition would be appropriate. Solving for the above boundary cases with U_norm ~ 0.25-0.3 for that case, we obtained a value for u_r ~ 20. To keep calculations simple, we chose a value of u_r = 20. The table below shows the value of U_norm for the boundary cases.

Situation	U_norm - Training	U_norm - Test
Completely perfect prediction	1.0	1.0
Triage everything	0.27	0.33
Triage nothing	0	0
Completely imperfect prediction	-0.73	-0.67

The measure U_max is calculated by assuming all relevant documents are retrieved and no nonrelevant documents are retrieved:

U_max = u_r * AP

(This happens to equal AN.)

Thus, for the training data,

U_raw = (20 * TP) - FP
U_max = 20 * 375 = 7500

U_norm = [(20 * TP) - FP] / 7500

(If you plug in the bounary conditions to the these equations, you should obtain the results specificed above.)

Likewise, for the test data,

U_raw = (20 * TP) - FP
U_max = 20 * 420 = 8400

U_norm = [(20 * TP) - FP] / 8400

Annotation evaluation measures

The measures for the annotation subtasks were based on the notion of identifying tuples of data. Given the article and gene, systems designated one or both of the folloing tuples:

<article, gene, GO hierarchy code>
<article, gene, GO hierarchy code, evidence code>

We employed a global recall, precision, and F measure evaluation measure for each subtask:

Recall = number of tuples correctly identified / number of correct tuples = TP / AP

Precision = number of tuples correctly identified / number of tuples identified = TP / AR
F = (2 * recall * precision) / (recall + precision)

For the training data, the number of correct <article, gene, GO hierarchy code> tuples was 593, while the number of correct <article, gene, GO hierarchy code, evidence code> tuples for the test data was 644.

Submitting results

We developed a Python program that calculates the normalized utility, recall, precision, and F score for all of the subtask submissions. The general format of the files is tab-delimited ASCII, with the first column denoting the subtask (triage, annotation hierarchy, or annotation hierarchy plus evidence code) and the last column containing a run tag, unique for the participant and the run. Below are the formats followed by an example for the specific subtasks:

Triage

triage    <PMID>    tag
triage    12213961  tag

Annotation hierarchy

annhi     <PMID>    <Gene>   <Hierarchy code>  tag
annhi     12213961  Stat4    BP                tag

Annotation hierachy plus evidence

annhiev	  <PMID>    <Gene>   <Hierarchy code>  <Evidence code>   tag
annhiev   12213961  Stat4    BP                IDA               tag

(Submitted results should not include a header, but only data.)

Program to calculate results

The cat_eval program calculates all of the results for the three sub-tasks (even though the official measure for the triage subtask is normalized utility and for the hierarchy annotation subtasks is the F-score). There is a generic version along with a Windows version as an executable. Python 2.3 or above must be installed to run the program. The program has been tested on Windows and Solaris but should run on just about any OS where Python can be installed. Similar to trec_eval, there is very little error-checking for proper data format, so data files should be in the format specified on this page. For any issues, email Ravi Teja Bhupatiraju at bhupatir@ohsu.edu.

Usage: cat_eval.py data_file gold_standard_file

The package comes with three sets of sample files for each of subtask. The files annhi.txt and annhiev.txt are slightly modified versions of pgd+train.txt and pgde+train.txt respectively. The file retrieved.txt is from an actual run.

These are the files for the triage subtask:
Sample Data: retrieved.txt
Gold Standard: triage+train.txt

These are the files for the annotation hierarchy subtask:
Sample Data: annhi.txt
Gold Standard: pgd+train.txt

These are the files for the annotation hierarchy plus evidence subtask:
Sample Data: annhiev.txt
Gold Standard: pgde+train.txt

Here are the formulae used in the program:

Precision = TP / (TP + FP)
Recall = TP / AP
F-score = 2.0 * precision * recall / (precision + recall)
UtilityFactor = 20
Raw Utility = (UtilityFactor * TP) - FP
Max Utility = (UtilityFactor * AP)
Normalized Utility = (UtilityFactor * TP - FP) / (UtilityFactor * AP)

Here is sample output you should get from the triage subtask sample data:

>> cat_eval.py retrieved.txt triage+train.txt
Run: OHSU-TRIAGE-CLASSIFIER-V72
Counts: tp=321; fp=1558; fn=54;
Precision: 0.1708
Recall: 0.8560
F-score: 0.2848
Utility Factor: 20
Raw Utility: 4862
Max Utility: 7500
Normalized Utility: 0.6483

To generate results that are amenable to processing by spreadsheets, statistical programs, and so forth, the command-line tag csv can be added to generate a space-delimited file:

bash-2.05$ ./cat_eval.py retrieved.txt triage+train.txt csv
Run     TP      FP      FN      Precision       Recall  F-Score Utility Factor Raw Utility     Max Utility     Normalized Utility
OHSU-TRIAGE-CLASSIFIER-V72      321     1558    54      0.1708  0.8560  0.2848 20      4862.0  7500.0  0.6483