This page contains the instructions that will be given to the human annotators for the task; they may also provide participants with a more complete picture of how we will be evaluating their results.

Guidelines for Evaluating the Answers and References Generated by LLMs

Introduction

Large language models (LLMs) have shown the ability to achieve state-of-the-art performance across a wide range of natural language processing and information retrieval tasks. In the TREC 2024 BioGen task, we will focus on (a) reference attribution and (b) the quality and factuality of the text generated by LLMs to answer clinical questions asked by clinicians to (a) satisfy their own information needs and (b) answer health-related questions asked by their patients. For patients, we envision that the answers will be reviewed by clinicians and subsequently explained in plain language.

This evaluation aims to verify the references attributed and the quality of LLM-generated answers. For the former, we will evaluate how well the answers to clinical questions are supported by evidence provided by the models in the form of references (in this evaluation, PubMed abstracts serve as cited documents).

An example of a question and generated answer with supporting documents:

Note: this is not a literal example of the required system output format, but rather is a schematic example of what an annotator might see.

Question: In adult patients with total hip replacements, how effective is pain medication in recovery?

Generated Answer: Pain medication has been found to be effective in controlling pain during recovery from total hip replacements [3][4]. The pain scores in the postoperative course of patients undergoing total hip arthroplasty have been shown to decrease consistently with increasing mobilization and rehabilitation in the first week [3] …

Cited Documents:

[1] PMID: 24996539

Background: Surgical pain is managed with multi-modal anaesthesia in total hip replacement (THR) and total knee replacement (TKR). It is unclear whether including local anaesthetic infiltration before wound closure provides additional pain control …

[2] PMID: 28870302

Total hip and knee arthroplasty is associated with significant perioperative pain, which can adversely affect recovery by increasing risk of complications, length of stay, and cost. Historically, opioids were the mainstay of perioperative pain control …

…

Part 1: Evaluating Answer Alignment with Questions and Evidence Support

In the first step of the annotation, you will evaluate whether the generated text, taken as a whole, directly answers the question.

If not, select NO, and move to the next answer.

Interpretability of the Answer Sentences:

In the next step of the annotation, you need to evaluate the interpretability of each consecutive sentence of the generated answer, which we will refer to as the answer sentences. You will evaluate whether each answer sentence is interpretable by you.

For each answer sentence, determine if all of the information relayed by the answer sentence is interpretable to you.

You have to determine if you can understand the answer sentence on its own. If there is any part of the answer sentence that is unclear or hard to interpret, select “No” and move on to the next sentence in the generated answer.

An answer sentence may be uninterpretable due to:

Vague or ambiguous meaning, e.g., unclear noun references or pronoun use.
Malformed phrases and sentences that are difficult to understand.

If the answer sentence is interpretable, you will proceed to the next annotation step.

Note: While evaluating the interpretability, consider yourself a healthcare professional, not the consumer. Some medical terminology might be uninterpretable to the consumer but interpretable to the healthcare professional.

Identifying Evidence Support for the Generated Answer Sentences:

In this step, for each generated answer sentence, you need to select a relevant sentence from each document referenced in the sentence to determine the document’s relation to the generated assertion, if there is any. Four relations between the answer sentence and the document are possible: Supports, Contradicts, Neutral, and Not Relevant.

Assertions are statements of single facts, e.g., knee arthroplasty is associated with significant perioperative pain. An answer sentence may contain more than one assertion. The task is to identify the assertions from the answers. Generally, each statement in the generated answers is an assertion, barring some exceptions, including statements about the speaker itself (e.g., “As a language model, I do not have the ability to answer the specific treatment.”) and questions posed to the user ((e.g., “Would you like to learn more?”)

The LLM-generated answers come with the corresponding cited documents, which, according to the model, support the assertions.

Note: Statements in the answer sentence that are defining medical terms or procedures should not be selected as assertions in this evaluation. Our goal is to evaluate the claim generated by the LLMs on diseases, symptoms, drugs, treatments, and medical procedures, etc., not the definition of terms or procedures.

Note: You can select all the sentences from the documents that are relevant to the assertion. The interface allows you to choose non-continuous sentences from the document.

Examples

Answer Sentence: When a kidney stone has been present for almost three years, there are several treatment options available.

Assertion:

[for a kidney stone] there are several treatment options available

Answer Sentence: One approach is medical expulsive therapy, which involves the use of drugs to increase the rate of stone expulsion along the ureter.

Assertion:

use of drugs to increase the rate of stone expulsion along the ureter

Answer Sentence: However, doses can be increased to achieve a high-quality response while minimizing side effects, with the goal of reaching a dose range that balances efficacy, tolerability, and acceptability.

Assertions:

doses can be increased to achieve a high-quality response while minimizing side effects
with the goal of reaching a dose range that balances efficacy, tolerability, and acceptability.

Labels

Given a question, a model’s answers and a list of documents referenced by the model, for each sentence in the model’s answer:

Highlight an atomic assertion / statement, e.g.:

pain scores decrease consistently with increasing mobilization and rehabilitation or Pain medication has been found to be effective in controlling pain during recovery from total hip replacements
For each atomic statement and for each document provided as a reference, select one of the following labels:

Supports: There is at least one sentence in the referenced document that supports/agrees with the statement, e.g.: > opioids were the mainstay of perioperative pain control

Contradicts: There is at least one sentence in the referenced document that disagrees with the assertion or states its opposite, e.g.: > Increasing pain levels after the first week postoperatively, for 3 days, are most likely to be caused by the change to more extensive mobilization and physiotherapy in the rehabilitation unit.

Neutral: The referenced document is topically relevant, but lacks any information to validate or invalidate the assertion.

Not relevant: The referenced document is not relevant to the sentence.

Part 2: Answer Quality and Completeness:

The goal of this step is to evaluate the relevance of the assertions in the answer sentences to the question.

For each assertion in the generated answer, provide one of the following labels:

Required: The assertion ‘XXX’ is necessary to have in the generated answer for completeness of the answers.
Unnecessary: The assertion ‘XXX’ is not required to have included in the generated answer. An assertion may be unnecessary for several reasons:
1. If including it would cause information overload if it is added to the answer;
2. If it is trivial, e.g., stating that many treatment options exist.
3. If it consists entirely of a recommendation to see a health professional.
4. If it is not relevant to the answer, e.g., describing the causes of a disease when the question is about treatments,
Borderline: If an assertion is relevant, possibly even “good to know,” but not required, the assertion may be marked borderline.

As an example, if an assertion describes an experimental treatment while the question is about best-established treatments. Borderline assertions are OK to include, but not ideal.
Inappropriate: The assertion may harm the patient, e.g., if according to the answer, physical therapy reduces the pain level, but the patient experiences more pain due to hip mobilization, the patient may start doubting they are receiving adequate treatment.

Other inappropriate content may include unverified claims, such as ivermectin treatment for COVID-19.

Evaluation