Update of TREC 2004 Genomics Track Data
This page describes the update of data from the TREC 2004 Genomics
Track that was done in February, 2005. More detail about the
protocol for using the data can be found in the 2004 track
protocol. There is also a page
that describes all the data for the 2004 track. To use just the
data that has been updated, you can download the file rest.tar.gz from this site, whose contents are
described on the data page. If you did not participate in TREC
2004 and wish to obtain the data for research use, a data usage
agreement must be signed. To obtain the forms and
access to the data, visit the data page of the
2004 Genomics Track at the NIST TREC Web site.
Data from the TREC 2004 Genomics Track were updated in early 2005 due
to concerns that the MGI GO
annotation files were substantially updated enough from the original
"snapshot" of data in the spring of 2004 to change the results of
experiments. There were also some minor errors in the triage
subtask data uncovered, in that four PMIDs were listed in the triage
test files but were not officially part of the document collection.
Initial concerns about massive updates by MGI were unfounded. It
turns out that the individual who found the go_refs.mgi
file that appeared to be significantly different from our data had
misunderstood what was in that file. So a key point to note up
front is that the MGI data have only been incrementally updated, as
will be seen from the numbers below.
Updating the data
This section describes the four steps taken to update the data.
1. Get new files from MGI
Two files were obtained from MGI:
- An updated file of PMIDs that had been annotated or traiged for
annotation. The status of PMIDs could change because MGI decided
to annotate an article it had not previously assessed for triage or
because it had decided to not annotate an article after it was
initially triaged for annotation.
- The latest go_refs.mgi file, which lists all PMIDs that have GO
annotations (but does NOT contain a list of those triaged but not yet
annotated). A small number of PMIDs are designated for annotation
outside the triage process, which is why PMIDs not triaged for
annotation could still be annotated (and end up in this file).
We processed these files to obtain the following results:
- Documents newly triaged for annotation from triage file - 22
- Documents previously triaged but now decided not to be annotated
from triage file - 7
- Documents decided to be annotated in go_refs.mgi - 70
This update of the data yielded 22-7+70 = 85 new positive examples to
add to our data. This represented a 85/791 = 10.7% increase in
the size of the positive examples. Looked at another way, this
changed the number of positive examples in the entire collection (of
11880 documents) from 795/11880 = 6.7% to 876/11880 = 7.4%. If
the results follow the pattern when revision of relevance judgments is
done in IR tasks, this would likely result in modest changes in the
scores of indivudual runs but few if any changes in the performance of
runs relative to one another.
2. Find all PMIDs whose triage for annotation or annotation
status has changed
The above two files were processed to identify PMIDs in our two-year,
three-journal collection (i.e., those in the crosswalk files).
The table below shows the numbers of PMIDs in the original collections,
those added and substracted, and the sizes of the new files.
Collection
|
PMIDs in TREC 2004 collections
|
PMIDs after error correction
|
PMIDs added from new triage file
|
PMIDs subtracted from new triage
file
|
PMIDs added form go_refs.mgi
|
PMIDs in update triage files
|
Training
|
375
|
375
|
7
|
6
|
53
|
375+7-6+53=429
|
Test
|
420
|
416
|
15
|
1
|
17
|
416+15-1+17=447
|
Both
|
795
|
791
|
22
|
7
|
70
|
876
|
3. Update triage subtask files
We updated the triage subtask files by adding the PMIDs newly
designated for GO annotation or newly annotated and subtracting the
PMIDs that were "un-triaged" to the files triage+train.txt and
triage+test.txt. Recall that the negative examples were
everything in the crosswalk.train.txt and crosswalk.test.txt files that
were not in the positive-example files. We then re-ran some
experiments to see the impact of the changes. We used the
cat_eval program to calculate our results. The new, updated files
are named the same as the original files (triage+train.txt and
triage+test.txt) and the old files have been renamed to reflect their
use in 2004 (triage+train04.txt and
triage+test04.txt).
Our first run repeated the OHSU runs from the 2004 track, which scored
at about the 80th percentile (12th out of 63). This effort, led
by OHSU team member Aaron Cohen, included reparsing the text files,
recreating the chi-square based model from the training set, retraining
the classifier on the training set, and applying the classifier to the
test data: Here are the results, which saw the utility score
improve slightly from 0.4983 to 0.5465:
Counts: tp=323; tn=4263; fp=1574; fn=124
Precision: 0.1703
Recall: 0.7226
F-score: 0.2756
Accuracy: 0.7298
Utility Factor: 20
Raw Utility: 4886
Max Utility: 8940
Normalized Utility: 0.5465
Sensitivity: 0.7226
Specificity: 0.7303
Naturally, everyone will want to know how the best performing run that
used only the MeSH term Mice did.
Here are the results:
Counts: tp=400; tn=3743; fp=2094; fn=47
Precision: 0.1604
Recall: 0.8949
F-score: 0.2720
Accuracy: 0.6593
Utility Factor: 20
Raw Utility: 5906
Max Utility: 8940
Normalized Utility: 0.6606
Sensitivity: 0.8949
Specificity: 0.6413
The MeSH Mice run with the
previous data had a utility of score of
0.6404. The new data raised this score to 0.6606, a very modest
bump. The OHSU run did improve more than the Mice run, so it
is possible that others run will improve more than the Mice run
also, but the overall results are unlikely to change significantly.
4. Update annotation hierarchy subtask files
Updating the annotation hierarchy subtask files was easier, although
the number
of files (20) made the process somewhat more complciated. We did
not update the files of positive examples, since none of those PMIDs
turned negative. (There was no absoulte reason to update these
files since there is no guarantee they are exhaustive anyways.
All we needed to be sure was that no PMIDs in the negative examples
were misclassified.) So in other words, only articles triaged for
annotation were reclassified for non-triage, and those already
annotated were not "un-annotated." We thus merely had to identify
which PMIDs in the negative examples were now annotated and therefore
needed to be removed.
The files can be divided into three categories: those
containing all the data, those containing positive examples, and those
containing negative examples. See the TREC
2004 Genomics Data Files page for definitions as to what is in
these files.
All of the files of negative examples changed due to deletion of PMIDs
(and PMID-gene pairs) that were now annotated and hence no longer
negative examples. (We did not add these to the positive example
files.) Here is a list of files with the number of changed lines:
p-train.txt - 41 PMIDs deleted
pg-train.txt - 71 PMID-gene pairs deleted
p-test.txt - 17 PMIDs deleted
pg-test.txt - 27 PMID-gene pairs deleted
From the files of all examples, only the files with PMIDs changed,
i.e., the gtrain.txt and gtest.txt did not change. Since the
files of negative examples have only one line per PMID or PMID-gene
pair (unlike the positive examples, which have one line for every GO
domain, e.g., BP, MF, and CC), the corresponding files from the
negative examples had the same number of lines deleted. Here is a
list of files with the number of changed lines:
ptrain.txt - 41 PMIDs deleted
pgtrain.txt - 71 PMID-gene pairs deleted
ptest.txt - 17 PMIDs deleted
pgtest.txt - 27 PMID-gene pairs deleted
Although some genes were no longer part of the PMID-gene pairs, we did
not modify either of the gene/alias files, gtrain.txt and
gtest.txt.
So some of the genes and their aliases listed in these files are not
part of the collection, but this is just a reference file anyways.
None of the files of positive examples changed either. Here is a
list of
them for reference:
p+train.txt
pg+train.txt
pgd+train.txt
pgde+train.txt
all+train.txt
p+test.txt
pg+test.txt
pgd+test.txt
pgde+test.txt
all+test.txt
The table below is an update of the contents, names, and line counts of
the data
files for the annotation hierarchy subtask. Note that all files
whose content changed have retained their original names. The old
files have had "04" added to their names.
Here is an interpretation of the numbers in
the table: For the training data, there are a total of 463
documents that are either positive (one or more GO terms assigned) and
negative (no GO terms assigned) examples. There are
1347 unique document-gene pairs in the training data. The data
from the first three rows of the table differ from the rest in that
they contain data merged from positive and negative examples.
These are what would be used as input for systems to nominate GO
domains or the GO domains plus their evidence codes per the annotation
task. When the test data are released, these three files are the
only ones that will be provided.
For the positive examples in the training data, there are 178 documents
and 346 document-gene pairs. There are 589 document-gene name-GO
domain tuples (out of a possible 346 * 3 = 1038). There are 640
document-gene name-GO domain-evidence code tuples. A total of 872
GO plus evidence codes have been assigned to these documents. For
the negative examples, there are 285 documents and 1001 document-gene
pairs.
File contents
|
Training data file name
|
Training data count
|
Test data file name
|
Test data count
|
Documents - PMIDs
|
ptrain.txt
|
504-41=463
|
ptest.txt
|
378-17=361
|
Genes - Gene symbol, MGI
identifier, and gene name for all used
|
gtrain.txt
|
1294
|
gtest.txt
|
777
|
Document gene pairs - PMID-gene
pairs |
pgtrain.txt |
1418-71=1347 |
pgtest.txt |
877-27=850 |
Positive examples - PMIDs
|
p+train.txt
|
178
|
p+test.txt
|
149
|
Positive examples - PMID-gene
pairs
|
pg+train.txt
|
346
|
pg+test.txt
|
295
|
Positive examples -
PMID-gene-domain tuples
|
pgd+train.txt
|
589
|
pgd+test.txt
|
495
|
Positive examples -
PMID-gene-domain-evidence tuples
|
pgde+train.txt
|
640
|
pgde+test.txt
|
522
|
Positive examples - all
PMID-gene-GO-evidence tuples
|
all+train.txt
|
872
|
all+test.txt
|
693
|
Negative examples - PMIDs
|
p-train.txt
|
326-41=285
|
p-test.txt
|
229-17=212
|
Negative examples - PMID-gene
pairs
|
pg-train.txt
|
1072-71=1001
|
pg-test.txt
|
582-27=555
|
Because OHSU did not have any runs in the annotation subtask, we were
unable to rescore any runs. Any of the groups who did have runs
are encouraged to rescore their runs and let us know their results.
What to do from here
We realize that having new versions of the data potentially introduces
complications into how researchers subsequently carry out and report
experiments. Those running new experiments are encouraged to
notify the track chair, and we will consider posting some of them on
the Web site. Papers using this data should explicitly state
whether the original (2004) or new (2005) versions of the data are used.
Last updated - February 28, 2005