TREC Genomics Track - August, 2005 Survey

Results of TREC Genomics Track Survey

There are some obvious conclusions:

Respondents want to keep the current user focus on biological researchers.

Respondents are most interested in using documents that are full-text journal articles.

For the second task, respondents are most interested in information extraction, ie, text mining.

About half of all groups are interested in interactive experiments.

1. The track will continue to have an ad hoc retrieval task based on real information needs. For the 2006 track, rank the following populations of users that you would like to see the track simulate.

Rank

Response Total

Biological researchers (current focus)

77% (20)

4% (1)

19% (5)

Health care professionals (e.g., physicians, nurses)

15% (4)

69% (18)

15% (4)

Laypeople (e.g., general public, patients, consumers)

12% (3)

75% (18)

Total Respondents

(skipped this question)

2. If we were to focus on biological researchers or health care professionals, rank the following document sets you would like to see used for ad hoc retrieval experiments.

Rank

Response Total

Continued use of MEDLINE

36% (9)

44% (11)

4% (1)

16% (4)

Large collection of full-text journal articles (e.g., 100-200 journals spanning 3-5+ years)

60% (15)

28% (7)

8% (2)

4% (1)

General Web crawl (e.g., the half-terabyte TREC Terabyte .GOV crawl that has biomedical content from NIH, NLM, CDC, etc.)

16% (4)

0% (0)

32% (8)

52% (13)

Focused biomedical crawl (over .GOV or other sites)

8% (2)

12% (3)

44% (11)

36% (9)

Total Respondents

(skipped this question)

3. If we were to focus on laypeople, suggest possible resources we could use, remembering that we need a substantial enough amount of content to make the retrieval experiments interesting.

Total Respondents

Web pages.
WebMD.
I think it would be best not to focus on laypeople but to expand our data to non-fulltext, structured data.
http://medlineplus.gov/ http://www.healthfinder.gov/ http://www.cdc.gov/ webMD http://www.health.state.ny.us/healthaz/.
Papers from Science, Nature and Cell, etc.
Science popularization journal or magazine.
Scripts and/or screenplays from the TV show CSI. Web BLOGS on genetics. Chat room transcripts from genetic disease specific chat rooms.
Some Web sites for laypeople - MedlinePlus.
No suggestions.
Medline Plus, ClinicalTrials.gov, health.nih.gov, www.healthfinder.gov, www.ahcpr.gov/consumer/.

(skipped this question)

4. Please add any other comments you may have about the ad hoc retrieval experiments for the track.

Total Respondents

The corpus should have links or references between documents, to highlight more referred or cited documents, this is due hypertextual nature of the Web.
We didn't participate this year because of the change in format and focus. I'd prefer the use of the full character set from whatever collection is used rather than the reduced set used so far.
There a little training data every year, as the adhoc task changes quite a lot including the topic form and content. It is unlike other tracks (such as QA and Terabyte) at this point. So it is really hard to determine how to configure the parameters in our system reasonably.
Make the queries realistic, using the wording that a researcher/doctor/layperson would use, depending on the user population.
We may need more samples and relevance judgments.
The topics should be more formatted.
For next year, we'd like to see have a collection of full-text journal articles. Based on our discussions with biologists, we think it'll be interesting to use for example citations in the retrieval process. Our biologists usually start their retrieval process with a keyword based search, but after this initial search they continue by looking at the top 20 retrieved docs. Based on their topic, their authors and their citations they continue their search. We would be really interested in using these metadata in the retrieval process.
How about looking at document relationships, e.g., (1) sets of documents that conflict with one another or (2) sets of documents that use the same experimental technique, or (3) documents relevant to an enumerated aspects (e.g., experimental techniques, previous results, findings relevant to components) of some explicitly stated biological hypothesis?

(skipped this question)

5. We will also continue to have a second task in the track. Rank the following list of tasks in order of their desirability.

Rank

Response Total

Question-answering (similar to the current TREC QA track)

27% (6)

9% (2)

41% (9)

23% (5)

Summarization (automatic summarizing of biomedical documents)

5% (1)

32% (7)

36% (8)

27% (6)

Categorization (similar to the current task in the track, perhaps with data different from MGI)

27% (6)

23% (5)

18% (4)

32% (7)

Extraction (extraction of concepts, facts, etc. from text)

45% (10)

32% (7)

18% (4)

5% (1)

Total Respondents

(skipped this question)

6. Please add any other comments you may have about the second task for the track.

Total Respondents

It would be great to be doing something that overlaps with other NLP sub-communities--either QA or summarization would be quite cool.
We're primarily interested in linkage -- mapping text mentions to database records, such as mentions in MEDLINE to EntrezGene accession numbers. A secondary interest is classification by topic.
The judgment of the second task is open. It requires all the participants not cheating, as they can tuning the system just fitting for the test data.
Information extraction and/or summarization would draw a larger pool of participants.
More training data.
The task like QA will be more interesting.
Information extraction to sophisticated representations. Most IE work has been looking at very simple 'atomic' extractions. I would like to see more complex, frame-y representations get filled, perhaps in the tradition of MUC. For example, consider the 'crossing the clause boundary' paper at ISMB last June -- finding phosphorylation catalysts and substrates -- as a first step.

(skipped this question)

7. Would you have the interest and resources to participate in an interactive task (ie, experiments with real users) in the Genomics Track?

Response Percent

Response Total

Yes

47.8%

52.2%

Total Respondents

(skipped this question)

8. If Yes, what types of experiments or research questions would you be interest in?

Total Respondents

High-throughput.
If the users would be interested in having a software that will facilitate massive collaborative curation of knowledge from texts and documents, and what features they would like in such a software.
Real biological researchers as subjects. Query refinement through a variety of means would be our main interest. Basically, like the HARD task. I would like to see the focus be on difficult tasks like teasing apart various usages of ambiguous terms, or finding just human instances of genes, etc.
Scientific discovery - an interactive tool that helps scientists apply the scientific method.
Expert annotation standards and evaluation.
Ontology-based interactive retrieval: how can we use ontologies to help users to retrieve, rank and display relevant documents/information.
The natural one is query refinement or at least use of relevance feedback. A more ambitious thing would be an interactive annotator markup -- perhaps annotators start marking up a document and are provided with useful additional material (within the same document or across documents) on the fly.

(skipped this question)