Stretch Goal: Structured Output

Often, we have multiple pieces of data we would like to abstract at once, or our data has a more complicated shape than a simple string or number. One way to deal with this is to invoke the LLM once per field, and then stitch things together into a dataframe; this option gives us maximal control over how each field is processed, and is sometimes the approach we need to take. However, it can also be somewhat tedious, and results in a lot of duplicated code.

An alternative is to use ellmer’s support for extracting structured data using LLMs.

The Basic Idea

The core idea is that rather than writing a separate query for each piece of information, we instead tell ellmer about our desired output structure: what fields we want, their datatypes (i.e. whether they are strings, numeric, factors, etc.), and so forth. Along with the structure definition, we can provide specific instructions to the LLM for each field.

For example, if we wanted to extract both geography and study design from our abstracts, we could describe a structure that included both together, like so:

abstracted_record_type <- type_object(
  "Data extracted from a published abstract of a clinical trial",
   geography = type_string("The geographic location in which the study took place."),
  study_design = type_string("The study design described by this paper")
)

Then, instead of calling the $chat() function, we would instead call the $chat_structured() function, and provide it with our prompt as well as our newly-defined type:

abstract_text <- lit_review_df %>% head(1) %>% select(abstract) %>% pull()

c <- chat_openai(model = "gpt-4.1-nano", 
                 system_prompt="Extract the specified data elements from an article abstract.",
                 params=params(seed=42)
     )

c$chat_structured(abstract_text, type=abstracted_record_type)
$geography
[1] "Hoima District Uganda"

$study_design
[1] "individual randomised superiority trial"

The result comes back as a standard named list.

Bulk-processing

If you are processing multiple records, you’ve got a couple of options. Previously, we’ve used the map_* family of functions to call our LLM function over each row of a dataframe; we can use that with chat_structured, and then stitch the resulting named lists back together into dataframe columns “by hand”. However, there’s a better way: this specific use case is common enough that ellmer provides a dedicated function to help us out: parallel_chat_structured().

You may recall meeting its un-structured cousin, parallel_chat(), in part 1; it involved a little bit of awkward post-processing to actually use its outputs. Fortunately, parallel_chat_structured() requires no such additional work, and gives us its output in the form of a dataframe:

first_10_abstracts <- lit_review_df %>% sample_n(5) %>% select(abstract) %>% pull %>% as.list
parallel_chat_structured(c, first_10_abstracts, type=abstracted_record_type)
geography study_design
Multiple locations with varying endemic levels of schistosomiasis (specific locations not provided) Large-scale pragmatic randomized controlled trials
Ugandan Interventional study on the impact of standard Schistosoma mansoni therapy in HIV-uninfected women
Gabon clinical trial
Senegal phase 3 trial
not specified in the abstract simulation study with cost analysis and probabilistic modeling

Iterating and Improving

Looking at these results, we see a few things that might warrant improvement in our prompt design. We will begin with geography, which is not being reported in a consistent manner or with a consistent degree of detail. Can you think of ways to improve things?

Here are a few things to think about:

  1. Can you change the wording to be more precise in what we are asking for in terms of output?
  2. Does it make sense to try and fit geography into a single column?
  3. Do all of the studies have a single geographic location? What should happen in those cases?

The main documentation page for this feature of ellmer has many more examples and things to try; one particularly important section to look at is the one on “Required vs. optional” data.

Final considerations

In general, the more complicated you make your model, the harder a time the LLM is going to have, and the bigger a difference in performance you might see between the “nano-scale” LLM that we have been using in this lab and the larger and more powerful models. However, remember that larger models do have larger costs, which can start to become an issue once you are running larger amounts of data through the model — especially with more complex structured output.

You may also find that for some fields, you need more fine-grained control over the prompt or the input format. It is very common for an LLM-centric data abstraction workflow to make use of multiple methods and multiple models for the different parts of the desired output! My advice: start small and simple, and then add complexity only when needed.