TREC Genomics - Background

TREC Genomics Track - Background

William Hersh, Track Chair
Oregon Health & Science University
hersh@ohsu.edu

This document provides background about the Text Retrieval Conference (TREC, http://trec.nist.gov ) and genomics information resources that are available.

Text Retrieval Conference

TREC is an annual activity of the information retrieval (IR) community aiming to evaluate systems and users. It is sponsored by the National Institute for Standards and Technology (NIST). IR has historically focused on document retrieval, but the field has expanded in recent years with the growth of new information needs (e.g., question-answering, cross-lingual), data types (e.g., video) and platforms (e.g., the Web). A key feature of TREC is that research groups work on a common source of data and a common set of queries or tasks. The goal is to allow comparisons across systems and approaches in a research-oriented, collegial manner.

TREC activity is organized into “tracks” of common interest, such as question-answering, multi-lingual IR, Web searching, and interactive retrieval. TREC generally works on an annual cycle, with data distributed in the spring, experiments run in the summer, and the results presented at the annual conference which usually takes place in November. TREC also has a notion of exploratory efforts, called “pre-tracks.” New tracks tend to come about when a critical mass of interest emerges within the community. This was the case for genomics, when IR researchers found themselves increasingly drawn to this societally relevant domain where rich resources were have already been developed.

Evaluation in TREC is based on the “Cranfield paradigm” that measures system success based on quantities of relevant documents retrieved, in particular the metrics of recall and precision . Operationally, recall and precision are calculated using a test collection of known documents, queries, and judgments of relevance between them. In most TREC tracks, the two are combined into a single measure of performance, mean average precision (MAP), which measures precision after each relevant document is retrieved for a given query. MAP is then usually averaged over all of the queries.

Some TREC tracks necessitate different evaluation metrics. The Question-Answering track, for example, focuses on finding a single answer to a question as high in the ranked output as possible. As such, the evaluation metric used is mean reciprocal rank (MRR). The performance metric in the Interactive Track has varied depending on the specific user task, but is usually a measure reflective of what the user has been asked to do, such as find one or many answers to a given question.

Genomics Information Resources

A great deal of genomics information resources are available. As the TREC Genomics track will need to stay focused in scope, we will be guided by several principles:

The task scenario will be that of a user seeking to acquire new knowledge in a sub-area of biology linked with genomics information (as opposed to a domain expert seeking information in his/her area of expertise)
The databases will be publicly available (though we will use proprietary information where it aids the experiments and can be obtained for research use, e.g., the full text of journal articles)
The focus of the task will be on text retrieval (though we will incorporate non-textual data, e.g., genomic sequences, when the task includes it, but not require systems or users to manipulate or interpret it)

Some readers may be unfamiliar with the basic biomedical information resources. A great deal of public data is available, most notably the resources from the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov ), a division of the National Library of Medicine (NLM, http://www.nlm.nih.gov ) that maintains most of the library’s genomics-related databases. MEDLINE is a bibliographic database of biomedical literature. Each record contains the title, authors, MeSH indexing terms, etc., but does not contain the full text. Most but not all records contain the article abstract. PubMed is the NLM Web system that provides access to MEDLINE as well as a dozen other databases. MEDLINE is the data and PubMed is the search system. Other vendors, such as Ovid Systems, license MEDLINE and load it into their own searching systems.

Key features of NCBI data include linkage and annotation. Linkage among resources allows the user to explore different types of knowledge across resources. For example, the original research documenting the discovery of a gene function appears in MEDLINE (PubMed, http://pubmed.gov ), with links to the nucleotide sequence in GenBank, the structure of the protein in the Molecular Modeling Database (MMDB), and an overview of the diseases it may cause in humans in the Online Mendelian Inheritance in Man (OMIM) textbook. LocusLink serves as a switchboard to integrate these resources together as well as provide annotation of the gene’s function using the widely accepted GeneOntology (GO, http://www.geneontology.org/ ). PubMed also provides linkages to full-text journal articles on the Web sites of publishers.

Additional genomics resources exist beyond NCBI. Of particular note are the model organism genome databases, such as:

Mouse Genome Informatics - http://www.informatics.jax.org/
Saccharomyces Genome Database - http://genome-www.stanford.edu/Saccharomyces/
Flybase Database of the Drosophilia Genome - http://flybase.bio.indiana.edu/

As with NCBI resources, these resources provide rich linkage and annotation. They provide great potential for circularity in IR as the annotations are fed to NCBI and incorporated into LocusLink as well as other resources, such as the SWISS-PROT protein sequence database ( http://us.expasy.org/sprot/ ).

Last updated - February 25, 2005