Genetic Optimized Gene and Protein Name Synonym Extraction

Gold Standard Data Files

Here are the data files used in our research on gene and protein name synonym extraction. If you use any of these files in your published research, I would appreciate you crediting our work by referencing this article:

Cohen AM, Hersh WR, Dubay C, Spackman K. Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts. BMC Bioinformatics 2005;6(103). [pre-print pdf]

  1. medline-gene-2001-sentences.zip: Approximately 140,000 sentences extracted from MEDLINE abstracts from the year 2001 having the word "gene" in them.
  2. medline-gene-2002-sentences.zip: Approximately 148,000 sentences extracted from MEDLINE abstracts from the year 2002 having the word "gene" in them.
  3. medline-gene-2003-sentences.zip: Approximately 152,000 sentences extracted from MEDLINE abstracts from the year 2003 having the word "gene" in them.
  4. precision-gold-standard.zip: Gene and protein name synonym pairs used as the gold standard for computation of precision.
  5. recall-gold-standard.zip: The 483 gene and protein name synonyms pairs found in SWISSPROT and the sentences in medline-gene-sentences-2003 used as the gold standard for computation of recall.
  6. manually-verified-pairs.zip: Gene and protein name synonym pairs initially considered incorrect and found to be synonyms after reviewing the MEDLINE database while performing the error analysis.
  7. stoplist.txt: Words and patterns easily confused with gene names and removed from further synonym extraction processing.