TREC 2006 Genomics Track Tools
Last updated - July 23, 2006
This pages describes tools for the 2006 track that allow use of natural
language processing (NLP) and
other
techniques.
NLP Tools
Tools have been submitted by Fabien Campagne of Cornell University and
Matt Lease of Brown University. These tools are provided on an "as is"
basis and not required or endorsed for use by track participants. The
same rules about public release of track data (e.g., data cannot be put
on public Web sites or shared with those who are not registered for
TREC) apply.
Locator tool for mapping sentence IDs and text fragments - Fabien
Campagne of Cornell University
We are now providing the locator tool for preview. This tool maps
sentence ids and text fragments to byte counts and length (needed for
the aspect evaluation measures).
The tool is described in more detail at:
http://chagall.med.cornell.edu/trec-gen/2006/locator.html
(login and password same as for the trec genomics track data)
This tool was written by Matt J. Wood (mostly) and myself. This is a
preview because we have not done our runs yet, and may discover
problems when we do so that our initial tests have not uncovered.
Please let us know if you encounter any problems. Feedback is welcome.
The index data file is optional and only needed if you will be using
the sentence ids defined in the processed corpus (http://chagall.med.cornell.edu/trec-gen/2006/).
Please note that this file is large (several Gb, splits into two parts,
see instructions for reassembling) and not required if you do not need
to convert sentence ids.
HTML->text conversion & sentence boundary detection by
Fabien Campagne provided input sentences for parsing - Matt Lease,
Brown University
Sentences were parsed by Charniak (2000) parser:
Eugene Charniak. ``A Maximum-Entropy-Inspired Parser.'' NAACL'00, pages
132-139.
http://www.cs.brown.edu/people/ec/papers/shortMeP.ps.gz
ftp://ftp.cs.brown.edu/pub/nlparser
Trained on Penn BioIE treebank: Mining the Bibliome PennBioIE Release
0.9 - http://bioie.ldc.upenn.edu
Plus about 4 trees from the GENIA treebank - http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/
Thrown in which cover numbered lists (such cases don't appear to occur
in the BioIE data and are prevalent in the TREC data) to improve parser
performance on such instances.
In 10-fold cross-validation, the parser achieves 85.0% PARSEVAL
F-measure on the BioIE treebank.
Constitutency parses were converted to dependency parses using ptbconv:
- http://www.jaist.ac.jp/~h-yamada
Mapping of documents into sentences - Martijn Schuemie, Erasmus
University Medical Center Rotterdam
The full document set can now be downloaded from www.biosemantics.org/TREC2006.
Username and password are the same as for the original data.
Normalized representation of the TREC questions - Alex Morgan,
Stanford and MITRE
These were done to enable their group to do a "manual" submission. This
is normalized in the sense of linked to things like EntrezGene and UMLS
to do things clever like term expansion and/or use the terminology
hierarchy.
The topics are presented in two forms, an Excel spreadsheet:
http://www.stanford.edu/~alexmo/slides/NormalizedTRECGen2006Questions.xls
and a tab-delimited text file:
http://www.stanford.edu/~alexmo/slides/NormalizedTRECGen2006Questions.txt