TREC 2006 Genomics Track Tools

Last updated - July 23, 2006

This pages describes tools for the 2006 track that allow use of natural language processing (NLP) and other techniques.

NLP Tools

Tools have been submitted by Fabien Campagne of Cornell University and Matt Lease of Brown University. These tools are provided on an "as is" basis and not required or endorsed for use by track participants. The same rules about public release of track data (e.g., data cannot be put on public Web sites or shared with those who are not registered for TREC) apply.

Locator tool for mapping sentence IDs and text fragments - Fabien Campagne of Cornell University

We are now providing the locator tool for preview. This tool maps sentence ids and text fragments to byte counts and length (needed for the aspect evaluation measures).

The tool is described in more detail at:
http://chagall.med.cornell.edu/trec-gen/2006/locator.html (login and password same as for the trec genomics track data)

This tool was written by Matt J. Wood (mostly) and myself. This is a preview because we have not done our runs yet, and may discover problems when we do so that our initial tests have not uncovered. Please let us know if you encounter any problems. Feedback is welcome.

The index data file is optional and only needed if you will be using the sentence ids defined in the processed corpus (http://chagall.med.cornell.edu/trec-gen/2006/). Please note that this file is large (several Gb, splits into two parts, see instructions for reassembling) and not required if you do not need to convert sentence ids.

HTML->text conversion & sentence boundary detection by Fabien Campagne provided input sentences for parsing - Matt Lease, Brown University

Sentences were parsed by Charniak (2000) parser:
Eugene Charniak. ``A Maximum-Entropy-Inspired Parser.'' NAACL'00, pages 132-139.
http://www.cs.brown.edu/people/ec/papers/shortMeP.ps.gz
ftp://ftp.cs.brown.edu/pub/nlparser

Trained on Penn BioIE treebank: Mining the Bibliome PennBioIE Release 0.9 - http://bioie.ldc.upenn.edu

Plus about 4 trees from the GENIA treebank - http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/

Thrown in which cover numbered lists (such cases don't appear to occur in the BioIE data and are prevalent in the TREC data) to improve parser performance on such instances.

In 10-fold cross-validation, the parser achieves 85.0% PARSEVAL F-measure on the BioIE treebank.

Constitutency parses were converted to dependency parses using ptbconv: - http://www.jaist.ac.jp/~h-yamada

Mapping of documents into sentences - Martijn Schuemie, Erasmus University Medical Center Rotterdam

The full document set can now be downloaded from www.biosemantics.org/TREC2006. Username and password are the same as for the original data.

Normalized representation of the TREC questions - Alex Morgan, Stanford and MITRE

These were done to enable their group to do a "manual" submission. This is normalized in the sense of linked to things like EntrezGene and UMLS to do things clever like term expansion and/or use the terminology hierarchy.

The topics are presented in two forms, an Excel spreadsheet:
http://www.stanford.edu/~alexmo/slides/NormalizedTRECGen2006Questions.xls
and a tab-delimited text file:
http://www.stanford.edu/~alexmo/slides/NormalizedTRECGen2006Questions.txt