TREC 2006 Genomics Track Data and Tools

Last updated - April 6, 2015

This page lists the files that are in the data distribution and the evaluation tools for the TREC 2006 Genomics Track. Additional detail about these files can be found in the 2006 track protocol. The data files themselves are available.

This page provides information about the following categories of data and tools for the TREC 2006 Genomics track:

Documents

The document collection for the TREC 2006 Genomics Track consists of full-text HTML documents from 49 journals who publish electronically via Highwire Press that granted permission for research use of their articles in the Genomics Track.  The documents have been obtained by a Web crawl of the Highwire site, with postprocessing to eliminate as much non-article material as we could.  Please note that we promised the publishers that this material would only be used for research purposes and would not be posted on public Web sites.

The full collection contains 162,259 documents from the 49 journals.  The files are available on the protected portion of the track Web site in WinZip format.  There are 59 .zip files, one for each journal, with the exception of the large Journal of Biological Chemistry that has its data in 10 files (one for each year).  The 59 .zip files total about 3 GB in size.  The full collection is about 12.3 GB when uncompressed.

There is also a text file, metadata.txt (Windows ASCII format, 11.9 MB), which lists the original URL of the article, the file name in our collection, and the file size in kilobytes is also available.  Note that the name of each document file is its PMID plus the extension .html, which facilitates accessing the associated MEDLINE record as described below.  Here is a sample of the text file:
http://www.rnajournal.org/cgi/content/full/9/1/1     12554869.html    73.4
http://www.rnajournal.org/cgi/content/full/9/1/9    12554870.html    46.43
http://www.rnajournal.org/cgi/content/full/9/1/14    12554871.html    38.17
The table below lists the .zip file name, size in MB, number of documents, and journal URL for each file in our collection.
Journal Name
File Name
File Size (MB)
Number of Docs Journal URL
American Journal of Epidemiology ajepidem.zip 24 1777 aje.oxfordjournals.org
American Journal of Physiology - Cell Physiology ajpcell.zip 62 2906 ajpcell.physiology.org
American Journal of Physiology - Endocrinology And Metabolism ajpendometa.zip 48 2462 ajpendo.physiology.org
American Journal of Physiology - Gastrointestinal and Liver Physiology ajpgastro.zip 48 2472 ajpgi.physiology.org
American Journal of Physiology - Heart and Circulatory Physiology ajpheart.zip 99 5170 ajpheart.physiology.org
American Journal of Physiology - Lung Cellular and Molecular Physiology ajplung.zip 48 2426 ajplung.physiology.org
American Journal of Physiology - Renal Physiology ajprenal.zip 39 1897 ajprenal.physiology.org
Alcohol and Alcoholism alcohol.zip 9.7 657 alcalc.oxfordjournals.org
Journal of Andrology andrology.zip 7.1 482 www.andrologyjournal.org
Annals of Oncology annonc.zip 16 1273 annonc.oxfordjournals.org
British Journal of Anaesthesia bjanast.zip 21 1843 bja.oxfordjournals.org
The British Journal of Psychiatry bjp.zip 17 1531 bjp.rcpsych.org
Blood blood.zip 209 11291 www.bloodjournal.org
Carcinogenesis carcinogenesis.zip 36 2022 carcin.oxfordjournals.org
Cerebral Cortex cercor.zip 22 917 cercor.oxfordjournals.org
Development development.zip 62 2402 dev.biologists.org
Diabetes diabetes.zip 37 2156 diabetes.diabetesjournals.org
Endocrinology endocrinology.zip 104 5517 endo.endojournals.org
European Heart Journal euroheartj.zip 15 1160 eurheartj.oxfordjournals.org
Glycobiology glycobiology.zip 15 719 glycob.oxfordjournals.org
Human Reproduction humanrep.zip 50 3784 humrep.oxfordjournals.org
Human Molecular Genetics humolgen.zip 58 3105 hmg.oxfordjournals.org
International Journal of Epidemiology ijepidem.zip 13 1203 ije.oxfordjournals.org
International Immunology intimm.zip 23 1175 intimm.oxfordjournals.org
Journal of Antimicrobial Chemotherapy jantichemo.zip 29 2720 jac.oxfordjournals.org
Journal of Applied Physiology jappliedphysio.zip 105 5751 jap.physiology.org
Journal of Biological Chemistry jbc-1995.zip 74 4368 www.jbc.org
Journal of Biological Chemistry jbc-1996.zip 33 4733 www.jbc.org
Journal of Biological Chemistry jbc-1997.zip 60 3098 www.jbc.org
Journal of Biological Chemistry jbc-1998.zip 59 2918 www.jbc.org
Journal of Biological Chemistry jbc-1999.zip 49 2432 www.jbc.org
Journal of Biological Chemistry jbc-2000.zip 111 5361 www.jbc.org
Journal of Biological Chemistry jbc-2001.zip 69 3262 www.jbc.org
Journal of Biological Chemistry jbc-2002.zip 119 5539 www.jbc.org
Journal of Biological Chemistry jbc-2003.zip 76 3510 www.jbc.org
Journal of Biological Chemistry jbc-2004.zip 132 6214 www.jbc.org
Journal of Biological Chemistry jbc-2005.zip 109 4886 www.jbc.org
The Journal of Cell Biology jcb.zip 93 3996 www.jcb.org
Journal of Clinical Endocrinology & Metabolism jclinicalendometa.zip 6.9 758 jcem.endojournals.org
Journal of Cell Science jcs.zip 54 2417 jcs.biologists.org
Journal of Experimental Biology jexpbio.zip 41 1911 jeb.biologists.org
Journal of Experimental Medicine jexpmed.zip 70 3492 www.jem.org
The Journal of General Physiology jgenphysio.zip 25 1014 www.jgp.org
Journal of General Virology jgenviro.zip 40 2375 vir.sgmjournals.org
Journal of Histochemistry and Cytochemistry jhistocyto.zip 24 1592 www.jhc.org
Journal of the National Cancer Institute jnci.zip 34 3214 jncicancerspectrum.oxfordjournals.org
Journal of Neurophysiology jneuro.zip 68 2874 jn.physiology.org
Molecular & Cellular Proteomics mcp.zip 9.5 426 www.mcponline.org
Microbiology microbio.zip 46 2400 mic.sgmjournals.org
Molecular Biology and Evolution molbiolevol.zip 25 1303 mbe.oxfordjournals.org
Molecular Endocrinology molendo.zip 36 1610 mend.endojournals.org
Molecular Human Reproduction molhumanrep.zip 14 817 molehr.oxfordjournals.org
Nucleic Acids Research nar.zip 126 7606 nar.oxfordjournals.org
Nephrology Dialysis Transplantation nephrodiatransp.zip 38 3629 ndt.oxfordjournals.org
Protein Engineering Design and Selection peds.zip 15 834 peds.oxfordjournals.org
Physiological Genomics physiogenomics.zip 13 656 physiolgenomics.physiology.org
Rheumatology rheumatolgy.zip 21 1985 rheumatology.oxfordjournals.org
RNA rna.zip 11 544 www.rnajournal.org
Toxicological Sciences toxsci.zip 33 1667 toxsci.oxfordjournals.org

We have discovered some issues with the full-text collection.  One of these was fixed:  the original files for the journal Blood were found to be lacking titles and author names and affiliations.  We have corrected this and uploaded a new .zip file (time stamped 5-20-06, renaming the old one to blood-old.zip).  We have also found the other following issues, for which we do not plan fixes:
In addition to the full-text data, the NLM has been kind enough to provide us with both ASCII and XML format collections of all the MEDLINE references for the full-text documents in our Highwire collection.

There are a couple caveats about this data.  First, we have identified 1,767 instances (about 1% of the 162K documents) where the Highwire file PMID is invalid.  We have investigated the problem and found that for all of instances we checked, the problem was the original Highwire file having an incorrect PMID.  In other words, the error is in the Highwire data, not a result of our processing.  For this reason, we have decided not to delete these files from the collection.  They represent, in our view, normal dirty data.  Second, there must also likely be incorrect PMIDs in the Highwire data that happen to map to valid PMIDs.  Below is a table of the number of erroroneous PMIDs we identified per journal.

Journal PMID Errors
American Journal of Epidemiology 85
American Journal of Physiology - Cell Physiology 5
American Journal of Physiology - Endocrinology And Metabolism 3
American Journal of Physiology - Gastrointestinal and Liver Physiology 2
American Journal of Physiology - Heart and Circulatory Physiology 13
American Journal of Physiology - Lung Cellular and Molecular Physiology 7
American Journal of Physiology - Renal Physiology 7
Alcohol and Alcoholism 22
Journal of Andrology 26
Annals of Oncology 33
British Journal of Anaesthesia 13
The British Journal of Psychiatry 53
Blood 53
Carcinogenesis 25
Cerebral Cortex 10
Development 36
Diabetes 2
Endocrinology 35
European Heart Journal 5
Glycobiology 1
Human Reproduction 86
Human Molecular Genetics 15
International Journal of Epidemiology 83
International Immunology 2
Journal of Antimicrobial Chemotherapy 21
Journal of Applied Physiology 231
Journal of Biological Chemistry 74
The Journal of Cell Biology 48
Journal of Clinical Endocrinology & Metabolism 68
Journal of Cell Science 40
Journal of Experimental Biology 30
Journal of Experimental Medicine 24
The Journal of General Physiology 14
Journal of General Virology 2
Journal of Histochemistry and Cytochemistry 1
Journal of the National Cancer Institute 238
Journal of Neurophysiology 0
Molecular & Cellular Proteomics 10
Microbiology 10
Molecular Biology and Evolution 12
Molecular Endocrinology 49
Molecular Human Reproduction 17
Nucleic Acids Research 19
Nephrology Dialysis Transplantation 101
Protein Engineering Design and Selection 3
Physiological Genomics 9
Rheumatology 122
RNA 1
Toxicological Sciences 21

The MEDLINE files are in the protected area of the Web site as follows:
The same restrictions apply to the NLM data as to the Highwire data:  If you have signed the TREC usage agreement, you are free to use the data in any way, but it must not be posted on any public Web site.

Another file in the protected area of the Web site is legalspans.txt.  This file contains all "legal spans" for all documents in the collection.  Legal spans are defined as any text >0 bytes in length between an HTML paragraph tag, which is any tag that starts with <p or </p.  There are a total of 12,641,127 legal spans in the collection.  We will use these spans to define allowed passages in the pooling and evaluation process, and you may find them useful in other ways.  The file has the format:
pmid<tab>span-start<tab>span-length
Note that the first and last span (and some in between) may contain a great deal of text that would not be a true passage, e.g., HTML <head> and Javascript.  Here is an example of the legalspans.txt file from document 15485830.html:
15485830	0	403
15485830 406 133
15485830 542 6
15485830 551 88
15485830 642 3660
15485830 4305 3828
15485830 8136 16
15485830 8155 2923
15485830 11081 1486
15485830 12570 2453
15485830 15026 1548
15485830 16577 3237
15485830 19817 814
15485830 20634 829
15485830 21466 701
15485830 22170 989
15485830 23162 1433
15485830 24598 840
15485830 25441 4191
15485830 29635 16
15485830 29654 3347
15485830 33004 81
15485830 33088 1961
15485830 35052 1949
15485830 37004 1745
15485830 38752 16
15485830 38771 1469
15485830 40243 2338
15485830 42584 16
15485830 42603 4528
15485830 47134 1191
15485830 48328 786
15485830 49117 2549
15485830 51669 1945
15485830 53617 16
15485830 53636 1607
15485830 55246 2798
15485830 58047 16
15485830 58066 2157
15485830 60226 3324
15485830 63553 3722
15485830 67278 3022
15485830 70303 2977
15485830 73283 1808
15485830 75094 734
15485830 75831 1036
15485830 76870 603
15485830 77476 831
15485830 78310 529
15485830 78842 20417
15485830 99262 16

Topics

There are 28 official topics, with seven topics from each last year's GTTs.  The topics are available in several forms:
Please note that columns of B-D of the spreadsheet represent last year's topics from which this year's questions are derived.  They are not part of the official question, and you should not assume that what is in those columns will in any way impact the relevance judgments.  They are there solely to show where the questions were dervied from.  The official questions are in column E of the spreadsheet and are reproduced in the 2006topics.txt file.

Sample passages and aspects for two topics are posted in both HTML and as a spreadsheet.  It should be remembered that these should not be considered training data, but rather are examples of what passages and aspects will look like in the final data.  It should also be noted that the PMIDs in these samples are not in the actual document collection.

Relevance Judgments

The file final.goldstd.tsv.txt contains gold standard correct passages, along with topicid and aspect MeSH terms, in this format:
<topicid> <pubmedid> <offset> <length> <MeSH aspects, separated by '|' characters>
The file trec2006.raw.relevance.tsv.txt contains the raw relevance judgements conducted for the pooled maximal length spans. The fields are tab separated and in this order:
<topicid> <pubmedid> <offset> <length> <spanid> <relevance>

Evaluation Tools

Listed below are the programs we developed, who was the lead programmer, and what each does. All programs are written in Python. Most people will be interested in the program that takes the relevance judgments (qrels) and allows calculation of MAP for passages, aspects, and documents. The qrels are in the file final.goldstd.tsv.txt in this direcotry, while the program that take a submission file and qrels to generate results is trecgen2006_score.py and is described below.

1. checkrun.py [path to submitted run] > STDOUT

(Hari Krishna Rekapalli, OHSU)
This program checks a submitted run file for the appropriate syntax and format, including the following:
A message "OK", or "ERROR at line nn" is printed.

2. poolspans.py [directory of submitted run files] [path to legalspans file] [pool size] > STDOUT

(Hari Krishna Rekapalli, OHSU)
This program takes the submitted checked runs and the legal spans file produces a set of pooled spans of the given pool size for each topic. The output is written to STDOUT, each line describing a span entered into the pool and the associated topic, in this tab-separated format:
<topicid> <pubmedid> <offset> <length>

3. makeforms.py [pooled spans file] [topics file] [directory of plaintext span contents] > STDOUT

(Aaron Cohen, OHSU)
This program takes the pooled spans file and creates a judging form that includes the pooled passage identifiers, the topics and the plain text content of the spans.

4. makegoldstd.py [completed judging forms] [path to legalspans file] [path to zipped html files] > STDOUT

(Aaron Cohen, OHSU)
This program takes one or more completed judging forms (as a glob expression), the legal spans file, and the path to the original zipped html full text files.
The output is a list of gold standard correct passages, along with topicid and aspect MeSH terms, in this format:
<topicid> <pubmedid> <offset> <length> <MeSH aspects, separated by '|' characters>

5. cleangoldstd.py > STDOUT

(Aaron Cohen, OHSU)
This program normalizes the MeSH terms assigned by relevance judges.

6. trecgen2006_score.py [gold standard file] [submission file] > STDOUT

(Aaron Cohen and Hari Krishna Rekapalli, OHSU)
This program takes the gold standard file generated above and a submitted entry file and computes and outputs the run scores as defined in the TREC 2006 Genomics track protocol page: passage, aspect, and document performance measures. Rank and score fields will be ignored.