Update of TREC 2004 Genomics Track Data

This page describes the update of data from the TREC 2004 Genomics Track that was done in February, 2005.  More detail about the protocol for using the data can be found in the 2004 track protocol.  There is also a page that describes all the data for the 2004 track.  To use just the data that has been updated, you can download the file rest.tar.gz from this site, whose contents are described on the data page.  If you did not participate in TREC 2004 and wish to obtain the data for research use, a data usage agreement must be signed.  To obtain the forms and access to the data, visit the data page of the 2004 Genomics Track at the NIST TREC Web site.

Data from the TREC 2004 Genomics Track were updated in early 2005 due to concerns that the MGI GO annotation files were substantially updated enough from the original "snapshot" of data in the spring of 2004 to change the results of experiments.  There were also some minor errors in the triage subtask data uncovered, in that four PMIDs were listed in the triage test files but were not officially part of the document collection.

Initial concerns about massive updates by MGI were unfounded.  It turns out that the individual who found the go_refs.mgi file that appeared to be significantly different from our data had misunderstood what was in that file.  So a key point to note up front is that the MGI data have only been incrementally updated, as will be seen from the numbers below.

Updating the data

This section describes the four steps taken to update the data.

1.  Get new files from MGI

Two files were obtained from MGI:
We processed these files to obtain the following results:
This update of the data yielded 22-7+70 = 85 new positive examples to add to our data.  This represented a 85/791 = 10.7% increase in the size of the positive examples.  Looked at another way, this changed the number of positive examples in the entire collection (of 11880 documents) from 795/11880 = 6.7% to 876/11880 = 7.4%.  If the results follow the pattern when revision of relevance judgments is done in IR tasks, this would likely result in modest changes in the scores of indivudual runs but few if any changes in the performance of runs relative to one another.

2.  Find all PMIDs whose triage for annotation or annotation status has changed

The above two files were processed to identify PMIDs in our two-year, three-journal collection (i.e., those in the crosswalk files).  The table below shows the numbers of PMIDs in the original collections, those added and substracted, and the sizes of the new files.

Collection
PMIDs in TREC 2004 collections
PMIDs after error correction
PMIDs added from new triage file
PMIDs subtracted from new triage file
PMIDs added form go_refs.mgi
PMIDs in update triage files
Training
375
375
7
6
53
375+7-6+53=429
Test
420
416
15
1
17
416+15-1+17=447
Both
795
791
22
7
70
876

3.  Update triage subtask files

We updated the triage subtask files by adding the PMIDs newly designated for GO annotation or newly annotated and subtracting the PMIDs that were "un-triaged"  to the files triage+train.txt and triage+test.txt.  Recall that the negative examples were everything in the crosswalk.train.txt and crosswalk.test.txt files that were not in the positive-example files.  We then re-ran some experiments to see the impact of the changes.  We used the cat_eval program to calculate our results.  The new, updated files are named the same as the original files (triage+train.txt and triage+test.txt) and the old files have been renamed to reflect their use in 2004 (triage+train04.txt and triage+test04.txt).

Our first run repeated the OHSU runs from the 2004 track, which scored at about the 80th percentile (12th out of 63).  This effort, led by OHSU team member Aaron Cohen, included reparsing the text files, recreating the chi-square based model from the training set, retraining the classifier on the training set, and applying the classifier to the test data:  Here are the results, which saw the utility score improve slightly from 0.4983 to 0.5465:
Counts: tp=323; tn=4263; fp=1574; fn=124
Precision: 0.1703
Recall: 0.7226
F-score: 0.2756
Accuracy: 0.7298
Utility Factor: 20
Raw Utility: 4886
Max Utility: 8940
Normalized Utility: 0.5465
Sensitivity: 0.7226
Specificity: 0.7303

Naturally, everyone will want to know how the best performing run that used only the MeSH term Mice did.  Here are the results:
Counts: tp=400; tn=3743; fp=2094; fn=47
Precision: 0.1604
Recall: 0.8949
F-score: 0.2720
Accuracy: 0.6593
Utility Factor: 20
Raw Utility: 5906
Max Utility: 8940
Normalized Utility: 0.6606
Sensitivity: 0.8949
Specificity: 0.6413

The MeSH Mice run with the previous data had a utility of score of 0.6404.  The new data raised this score to 0.6606, a very modest bump.  The OHSU run did improve more than the Mice run, so it is possible that others run will improve more than the Mice run also, but the overall results are unlikely to change significantly.

4.  Update annotation hierarchy subtask files

Updating the annotation hierarchy subtask files was easier, although the number of files (20) made the process somewhat more complciated.  We did not update the files of positive examples, since none of those PMIDs turned negative.  (There was no absoulte reason to update these files since there is no guarantee they are exhaustive anyways.  All we needed to be sure was that no PMIDs in the negative examples were misclassified.)  So in other words, only articles triaged for annotation were reclassified for non-triage, and those already annotated were not "un-annotated."  We thus merely had to identify which PMIDs in the negative examples were now annotated and therefore needed to be removed.

The files can be divided into three categories:  those containing all the data, those containing positive examples, and those containing negative examples.  See the TREC 2004 Genomics Data Files page for definitions as to what is in these files.

All of the files of negative examples changed due to deletion of PMIDs (and PMID-gene pairs) that were now annotated and hence no longer negative examples.  (We did not add these to the positive example files.)  Here is a list of files with the number of changed lines:
p-train.txt - 41 PMIDs deleted
pg-train.txt - 71 PMID-gene pairs deleted
p-test.txt - 17 PMIDs deleted
pg-test.txt - 27 PMID-gene pairs deleted

From the files of all examples, only the files with PMIDs changed, i.e., the gtrain.txt and gtest.txt did not change.  Since the files of negative examples have only one line per PMID or PMID-gene pair (unlike the positive examples, which have one line for every GO domain, e.g., BP, MF, and CC), the corresponding files from the negative examples had the same number of lines deleted.  Here is a list of files with the number of changed lines:
ptrain.txt - 41 PMIDs deleted
pgtrain.txt - 71 PMID-gene pairs deleted
ptest.txt - 17 PMIDs deleted
pgtest.txt - 27 PMID-gene pairs deleted

Although some genes were no longer part of the PMID-gene pairs, we did not modify either of the gene/alias files, gtrain.txt and gtest.txt.  So some of the genes and their aliases listed in these files are not part of the collection, but this is just a reference file anyways.

None of the files of positive examples changed either.  Here is a list of them for reference:
p+train.txt
pg+train.txt
pgd+train.txt
pgde+train.txt
all+train.txt
p+test.txt
pg+test.txt
pgd+test.txt
pgde+test.txt
all+test.txt

The table below is an update of the contents, names, and line counts of the data files for the annotation hierarchy subtask.  Note that all files whose content changed have retained their original names.  The old files have had "04" added to their names.

Here is an interpretation of the numbers in the table:  For the training data, there are a total of 463 documents that are either positive (one or more GO terms assigned) and negative (no GO terms assigned) examples.  There are 1347 unique document-gene pairs in the training data.  The data from the first three rows of the table differ from the rest in that they contain data merged from positive and negative examples.  These are what would be used as input for systems to nominate GO domains or the GO domains plus their evidence codes per the annotation task.  When the test data are released, these three files are the only ones that will be provided.

For the positive examples in the training data, there are 178 documents and 346 document-gene pairs.  There are 589 document-gene name-GO domain tuples (out of a possible 346 * 3 = 1038).  There are 640 document-gene name-GO domain-evidence code tuples.  A total of 872 GO plus evidence codes have been assigned to these documents.  For the negative examples, there are 285 documents and 1001 document-gene pairs.

File contents
Training data file name
Training data count
Test data file name
Test data count
Documents - PMIDs
ptrain.txt
504-41=463
ptest.txt
378-17=361
Genes - Gene symbol, MGI identifier, and gene name for all used
gtrain.txt
1294
gtest.txt
777
Document gene pairs - PMID-gene pairs pgtrain.txt 1418-71=1347 pgtest.txt 877-27=850
Positive examples - PMIDs
p+train.txt
178
p+test.txt
149
Positive examples - PMID-gene pairs
pg+train.txt
346
pg+test.txt
295
Positive examples - PMID-gene-domain tuples
pgd+train.txt
589
pgd+test.txt
495
Positive examples - PMID-gene-domain-evidence tuples
pgde+train.txt
640
pgde+test.txt
522
Positive examples - all PMID-gene-GO-evidence tuples
all+train.txt
872
all+test.txt
693
Negative examples - PMIDs
p-train.txt
326-41=285
p-test.txt
229-17=212
Negative examples - PMID-gene pairs
pg-train.txt
1072-71=1001
pg-test.txt
582-27=555

Because OHSU did not have any runs in the annotation subtask, we were unable to rescore any runs.  Any of the groups who did have runs are encouraged to rescore their runs and let us know their results.

What to do from here

We realize that having new versions of the data potentially introduces complications into how researchers subsequently carry out and report experiments.  Those running new experiments are encouraged to notify the track chair, and we will consider posting some of them on the Web site.  Papers using this data should explicitly state whether the original (2004) or new (2005) versions of the data are used.

Last updated - February 28, 2005