TREC Genomics Track - Background
William Hersh, Track Chair
Oregon Health & Science University
hersh@ohsu.edu
This document provides background about the Text Retrieval Conference
(TREC,
http://trec.nist.gov ) and genomics
information resources that are available.
Text Retrieval Conference
TREC is an annual activity of the information retrieval (IR) community
aiming
to evaluate systems and users. It is sponsored by the National
Institute
for Standards and Technology (NIST). IR has historically focused on
document
retrieval, but the field has expanded in recent years with the growth
of
new information needs (e.g., question-answering, cross-lingual), data
types
(e.g., video) and platforms (e.g., the Web). A key feature of TREC is
that
research groups work on a common source of data and a common set of
queries
or tasks. The goal is to allow comparisons across systems and
approaches
in a research-oriented, collegial manner.
TREC activity is organized into “tracks” of common interest, such as
question-answering,
multi-lingual IR, Web searching, and interactive retrieval. TREC
generally
works on an annual cycle, with data distributed in the spring,
experiments
run in the summer, and the results presented at the annual conference
which
usually takes place in November. TREC also has a notion of
exploratory
efforts, called “pre-tracks.” New tracks tend to come about when
a
critical mass of interest emerges within the community. This was the
case
for genomics, when IR researchers found themselves increasingly drawn
to
this societally relevant domain where rich resources were have already
been
developed.
Evaluation in TREC is based on the “Cranfield paradigm” that measures
system
success based on quantities of relevant documents retrieved, in
particular
the metrics of recall and precision . Operationally, recall and
precision
are calculated using a test collection of known documents, queries, and
judgments
of relevance between them. In most TREC tracks, the two are combined
into
a single measure of performance, mean average precision (MAP), which
measures
precision after each relevant document is retrieved for a given
query.
MAP is then usually averaged over all of the queries.
Some TREC tracks necessitate different evaluation metrics. The
Question-Answering
track, for example, focuses on finding a single answer to a question as
high
in the ranked output as possible. As such, the evaluation metric
used
is mean reciprocal rank (MRR). The performance metric in the
Interactive
Track has varied depending on the specific user task, but is usually a
measure
reflective of what the user has been asked to do, such as find one or
many
answers to a given question.
Genomics Information Resources
A great deal of genomics information resources are available. As
the
TREC Genomics track will need to stay focused in scope, we will be
guided
by several principles:
- The task scenario will be that of a user seeking to acquire new
knowledge
in a sub-area of biology linked with genomics information (as opposed
to
a domain expert seeking information in his/her area of expertise)
- The databases will be publicly available (though we will use
proprietary
information where it aids the experiments and can be obtained for
research
use, e.g., the full text of journal articles)
- The focus of the task will be on text retrieval (though we will
incorporate
non-textual data, e.g., genomic sequences, when the task includes it,
but
not require systems or users to manipulate or interpret it)
Some readers may be unfamiliar with the basic biomedical information
resources.
A great deal of public data is available, most notably the
resources
from the National Center for Biotechnology Information (NCBI,
http://www.ncbi.nlm.nih.gov ), a division of the National Library
of Medicine (NLM,
http://www.nlm.nih.gov ) that maintains most of the library’s
genomics-related databases.
MEDLINE is a bibliographic database of biomedical literature.
Each
record contains the title, authors, MeSH indexing terms, etc., but does
not
contain the full text. Most but not all records contain the
article
abstract. PubMed is the NLM Web system that provides access to
MEDLINE
as well as a dozen other databases. MEDLINE is the data and PubMed is
the
search system. Other vendors, such as Ovid Systems, license MEDLINE and
load
it into their own searching systems.
Key features of NCBI data include linkage and annotation. Linkage
among
resources allows the user to explore different types of knowledge
across
resources. For example, the original research documenting the
discovery
of a gene function appears in MEDLINE (PubMed,
http://pubmed.gov
), with links to the nucleotide sequence in GenBank, the structure of
the
protein in the Molecular Modeling Database (MMDB), and an overview of
the
diseases it may cause in humans in the Online Mendelian Inheritance in
Man
(OMIM) textbook. LocusLink serves as a switchboard to integrate these
resources
together as well as provide annotation of the gene’s function using the
widely
accepted GeneOntology (GO, http://www.geneontology.org/
). PubMed also provides linkages to full-text journal articles on the
Web
sites of publishers.
Additional genomics resources exist beyond NCBI. Of particular
note
are the model organism genome databases, such as:
As with NCBI resources, these resources provide rich linkage and
annotation.
They provide great potential for circularity in IR as the annotations
are
fed to NCBI and incorporated into LocusLink as well as other resources,
such
as the SWISS-PROT protein sequence database (
http://us.expasy.org/sprot/ ).
Last updated - February 25, 2005