First Steps with ellmer

Introduction

This exercise is designed to give you a bit of practice at using an LLM to assist in a literature review scenario. You will be working with the results of a preliminary PubMed query for clinical trials about Schistosomiasis, a disease caused by infection by one of several different parasitic helminths from the genus Schistosoma. It is endemic to Africa, some parts of South America, as well as parts of Southeast Asia, and is a significant global cause of morbidity and mortality.

In the first part of the activity, you will be using an LLM to assist in the screening and data abstraction process; in the second part, we will explore some additional techniques that can increase efficiency and methodological repeatability.

Important

This is a use case with a great deal of potential utility! However, it is important to remember that much of the value of a literature review comes from actually reading through the results, and as such, you should be mindful of how you use tools such as LLMs for this sort of task: they can save a lot of time, but can also lead to errors, mistakes, and lapses.

My personal advice for designing LLM-assisted workflows is to first spend time performing the task “by hand,” rather than diving head-first into an attempt at automation. This will help you: a) become familiar with the data; b) better understand what your actual needs are in terms of information and organization; and c) help you craft more useful instructions to the LLM, since you will know more about the range of possible inputs it is likely to encounter.

Meet your tools

There exist a great many different ways of interacting with an LLM from R; Luis Arregoitia’s excellent guide covers many of them, and is a good place to start. Today, we will be using the ellmer library, as it strikes a good balance between having a simple and easy to use API while still providing many useful features.

For today’s workshop, I have set up access for you all to OpenAI’s API, but ellmer supports virtually all LLM back-ends, including some that run locally on your own computer. We will be working today with OpenAI’s GPT 4.1 family of models. For our initial exercises we will use the smallest (and cheapest) “nano” variant, though you should feel free to experiment with other models should a larger model prove necessary.

To access OpenAI’s API, you must have an “API Key” — essentially, a “password”. It is common to provision a separate API key for each project, or for each group of users; this enables usage and costs to be tracked with more granularity, and to easily limit access to a resource (for example, by disabling an API key after a workshop has ended). For today’s activity, I have set up an API key for us all to use, and have sets a small cost limit, which it is very unlikely that we will exceed.

Watch where you send your data!

Recall the data governance challenges of accessing an LLM via an API: unless you have some specific contractual agreement in place with the LLM provider (such as a Business Associate Agreement for HIPAA-covered information), it must aboslutely not be used for any non-public data. Depending on what you are doing, it may be difficult to tell what counts as “non-public”!

If you are working with student grades, or data containing information about or shared by students, that is typically considered to be restricted, as of course is protected health information. You may also want to think about questions of your own IP and copyright interests, and be wary of uploading unpublished experimental data.

For today’s workshop, we will be working with data that are entirely public (i.e., PubMed records), so there is nothing to worry about.

Tip

You can set up your own OpenAI developer account; it’s free to create, and for getting practice with working with LLMs, $5 or $10 will go a very long way.

Step 0: Getting started with ellmer

First Steps

The first step to using ellmer is to make a new chat_* object. Since we are using OpenAI, we want to make a chat_openai object:1

first_chat <- chat_openai(model = "gpt-4.1-nano")

Before our chat object will actually work, though, we must make sure that ellmer knows about our OpenAI API key. There are a few ways to do this; in a pinch, you can explicitly tell ellmer what API key to use, right in your R code, like so:

first_chat <- chat_openai(model = "gpt-4.1-nano", api_key = "YOUR_OPENAI_KEY")

This is easy enough, but has a few problems:

  1. If you ever change your API key, or if you need to use a different key for a new project, you’ll need to change it in a ton of places in your code.
  2. If you ever share your code, you’ll also be giving away your API key; this could result in somebody else using it, and running up costs against your (or your workplace’s) account.

In general, things like API keys are best left out of your code. A better practice is to set them up as environment variables, by putting your API key in an .Renviron file somewhere. Posit has a helpful guide about how all of this works. For today’s activity, copy and paste the following into your .Renviron file:

OPENAI_API_KEY=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
Warning

This is absolutely not something you should ever do with your own API keys! They should be kept secret, like a password. It is OK in this specific scenario, though, since the API key is going to be deactivated after this session.

After adding the key to your .Renviron file, you’ll need to restart your R session before continuing.

Once you’ve set that up and created your chat object, we can use it to query our LLM:

first_chat$chat("What is the capital of Wales?")
The capital of Wales is Cardiff.
Important

This example is actually a questionable use case for an LLM, as it depends on the model having “world knowledge” beyond the contents of the query itself. In other words, the model has to actually know information about regional capitals; during the lecture, we discussed at length why this assumption is questionable.

In practice, expect LLMs to have highly variable and irregular performance at this sort of query, especially for inputs that go “off the beaten path”. For regions an capitals that are frequently mentioned in the English-language corpora on which LLMs were trained, we can probably expect a reasonable answer; for even slightly more obscure regions, and in non-English languages, the behavior could become less reliable.

Question: What do you think would happen if the LLM were queried about a region with a disputed capital?

Something to know about chat objects is that they are “stateful”, meaning that each one keeps around its entire history. To demonstrate this, we can ask our chat object about our last question:

first_chat$chat("What was the last question I asked?")
The last question you asked was, "What is the capital of Wales?"

This can be very useful, but is often not ideal. You also may have noticed that the output is being printed to the console, which is fine for interactive use… but normally, we want to access the LLM programmatically, meaning we want it to be driven by our program and our data. For this sort of use, a more common pattern is to wrap calls to the LLM in a function, like so:

capital_city <- function(some_region) {
  c <- chat_openai(model = "gpt-4.1-nano")
  prompt <- glue("What is the capital of {some_region}?")
  c$chat(prompt, echo=FALSE) # no need to print, we're in a function
}

And then, we can call it like any other R function:

capital_of_wales <- capital_city("Wales")
capital_of_czechia <- capital_city("Czechia")

We can combine this with other R routines and workflows easily:

# set up a toy dataset
regions_df <- tibble(region=c("Wales","Czechia","Idaho"), category=c("Country","Country","State"))

# Map our capital_city() function across each row
regions_df <- regions_df |> mutate(capital=map_chr(region, capital_city))

# View the result
regions_df
region category capital
Wales Country The capital of Wales is Cardiff.
Czechia Country The capital of Czechia (the Czech Republic) is Prague.
Idaho State The capital of Idaho is Boise.
Tip

Today, we are focusing on programamtic use of ellmer, but it does come with a couple of functions that can be used to set up an “interactive” conversation loop, right in your R session. Check out the live_console and live_browser functions.

Improving Our Prompt

You may notice that instead of actual capital cities, we have ended up with answers to our question in the form of a complete sentence. This is an example of LLMs being extremely literal: we asked a question, and it responded with an answer. Remember: by default, the LLM is trying to make output that looks statistically plausible according to its training data, and in its training data, it saw a lot of examples of narrative text written in complete sentences!

We’ve got a couple of different options for how to deal with this. First, we could adjust our prompt to more clearly describe what we want:

capital_city_revised <- function(some_region) {
  c <- chat_openai(model = "gpt-4.1-nano")
  prompt <- glue("What is the capital of {some_region}? Answer only with the name of the capital city; do not include any additional commentary.")
  c$chat(prompt, echo=FALSE)
}

Let’s try our code from above and see if that helped:

regions_df <- tibble(region=c("Wales","Czechia","Idaho"), category=c("Country","Country","State"))
regions_df <- regions_df |> mutate(capital=map_chr(region, capital_city_revised))
regions_df
region category capital
Wales Country Cardiff
Czechia Country Prague
Idaho State Boise

Another option would be to provide the LLM with a system prompt. This is a special prefix that is meant to give overarching directions to the model, to control its behavior, independent from any particular single query. Different LLM providers have different default system prompts, designed for different purposes and with different goals.

Let’s see how we might have solved our above issue of overly-wordy output by specifying a system prompt:

capital_city_sys_prompt <- function(some_region) {
  c <- chat_openai(model = "gpt-4.1-nano", 
                   system_prompt = "In your answers, be as terse as possible; do not include any additional language or commentary beyond the information requested." )
  
  prompt <- glue("What is the capital of {some_region}?")
  c$chat(prompt, echo=FALSE)
}
regions_df <- tibble(region=c("Wales","Czechia","Idaho"), category=c("Country","Country","State"))
regions_df <- regions_df |> mutate(capital=map_chr(region, capital_city_sys_prompt))
regions_df
region category capital
Wales Country Cardiff
Czechia Country Prague
Idaho State Boise
Note

For our examples thus far, our interactions have only involved one round of back-and-forth turn with the LLM, so at a practical level there is not much difference between making our primary prompt more elaborate and adding a system prompt. For more complex workflows, the system prompt becomes more important.

System prompts do not have to have anything to do with the actual content of the queries themselves! As a very silly example, consider this example:

capital_city_sys_silly <- function(some_region) {
  c <- chat_openai(model = "gpt-4.1-nano", 
                   system_prompt = "All answers must be given in the form of a limerick, and also must somehow mention a regionally-appropriate cheese." )
  
  prompt <- glue("What is the capital of {some_region}?")
  c$chat(prompt, echo=FALSE)
}

regions_df <- tibble(region=c("Wales","Czechia","Idaho"), category=c("Country","Country","State"))
regions_df <- regions_df |> mutate(capital=map_chr(region, capital_city_sys_silly))
regions_df |> gt()
region category capital
Wales Country In Cardiff, where seagulls do squeal, The capital by the sea, With Caerphilly near, And Welsh pride held dear, It's Cardiff, where heart's always free!
Czechia Country In Prague where tales do unfold, A city both lively and old, With a cheese so divine, Like Hermelín, fine, Its stories are ever retold.
Idaho State In Boise, where mountains are high, The capital does rest in the sky, With cheese from Sun Valley’s fame, It's where governors acclaim, Idaho’s heart beats a proud lullaby.

In all of these cases, the model was able to “understand” what we were asking of it; in practice, most real-world problems require a bit more iteration. For more complicated inputs, or more complicated prompts, this is an area where you may notice a difference between a small model such as gpt-4.1-nano and a later model: the larger models will generally be able to handle more complex and detailed instructions.2

Note

It is very common for system prompts to include specific instructions about the desired output format. Some common scenarios:

  1. Telling the model that it must answer with either “true” or “false”
  2. Instructing the model that its answer must be a number between 1 and 100
  3. Telling the model that its answer must be from a closed set of options (“small”/“medium”/“large”; “relevant”/“not relevant”, etc.)

Tip: think about what kind of column you want the output to be in your final dataframe!

Some Pesky Details

Deterministic Output

Before we move on to our lit review, there are two more minor details to cover.

First, recall that LLMs generate their output using a stochastic approach: they are literally “sampling” from a distribution of possible words to choose. This means that, if we give the LLM the same input multiple times, we might get different output:

writeLines(capital_city_sys_silly("Oregon"))
In regions where forests are green,  
Salem stands proud and serene,  
With Blue Cheese in sight,  
It’s Oregon's bright light,  
Capital where politicians convene.
writeLines(capital_city_sys_silly("Oregon"))
In Salem, where the Willamette gleams,  
Oregon's capital, or so it seems,  
With a hint of Tillamook,  
And a quaint little nook,  
It's Oregon's state capital dreams!

In scientific settings, this can cause massive headaches, as it will make getting repeatable output almost impossible. Unfortunately, this is a problem that is not 100% solveable when one is using an external LLM provider such as OpenAI; however, we can get 99% of the way there by specifying a seed to the API. This is analogous to using R’s set.seed() function before doing bootstrapping or some other statistical process that you want to be repeatable. Let’s modify our function to take this into account:

capital_city_sys_silly_seed <- function(some_region) {
  c <- chat_openai(model = "gpt-4.1-nano", 
                   system_prompt = "All answers must be given in the form of a limerick, and also must somehow mention a regionally-appropriate cheese.", 
                  params=params(seed = 42))
  
  prompt <- glue("What is the capital of {some_region}?")
  c$chat(prompt, echo=FALSE)
}

Now, if we invoke the function twice, we should get repeatable output:

writeLines(capital_city_sys_silly_seed("Oregon"))
In Salem, the capital's place,  
Oregon's proud and its grace,  
With Oregon cheese in the air,  
It's where policies dare,  
And government finds its own space!
writeLines(capital_city_sys_silly_seed("Oregon"))
In Salem, the capital's place,  
Oregon's proud and its grace,  
With Oregon cheese fine,  
It stands in a line,  
A city's sweet, steady pace.

Unfortunately, this is not a perfect solution; particularly for longer outputs, or interactions that span multiple turns before yielding the final answer, you may find that it is not 100% reliable. If you find yourself in such a situation, you may need to consider an alternative method of interacting with an LLM.

Dealing With More Data

If you’ve been following along, you may have noticed that our invocations of the LLM each took a little bit of time to run; if we had a lot of rows in our dataframe, we might not want to wait so long. ellmer has several methods for dealing with this; the easiest is to use is parallel_chat. This function takes a list of prompts, runs them all in parallel (taking into consideration limits on how many queries your account can run at once), and then gives back the results; it is a bit clunky to work with, because of how it provides the output. The saving grace here is that there is a second, and much cleaner, way to use parallel_chat, which we will be meeting in the second section of this workshop.

capital_city_parallel <- function(cities) {
  as_prompts <- as.list(glue("What is the capital of {cities}?"))
  
  c <- chat_openai(model = "gpt-4.1-nano", 
                   system_prompt = "In your answers, be as terse as possible-; do not include any additional language or commentary beyond the information requested.",
                   params=params(seed=42))
  
  results <- parallel_chat(c, as_prompts)

  # This part is admittedly kind of ugly...
  return(map_chr(results, ~.$last_turn()@text))
}

regions_df <- tibble(region=c("Wales","Czechia","Idaho"), category=c("Country","Country","State"))

regions_df |> mutate(capital=capital_city_parallel(region))
region category capital
Wales Country Cardiff
Czechia Country Prague
Idaho State Boise

Now that we’re a little bit familiar with ellmer and have done a bit of prompt iteration, let’s start on our lit review!

Some Gory and Ignorable Details

What’s going on there at the end of capital_city_parallel()? The short version is that parallel_chat() returns a list of Chat objects, which are somewhat complex data structure sthat contain an entire chat history as well as a lot of other metadata that we don’t care about for this specific function.

The snippet inside the map_chr() call is visiting each one, requesting its most recent turn, and then extracting its text content, and then gluing it all together into a character vector. This pattern is an example of using a “functional” style of programming; if you are not familiar with it, it may look a little bit strange, but it turns out to be a very efficient way of working with lists of things. If you are interested, the map() documentation has a lot more information about this way of thinking.

Footnotes

  1. There are a variety of ways we can customize our chat object; the documentation has more information and examples.↩︎

  2. And produce more faithful output; gpt-4.1-nano’s output does not always quite “scan” as a limerick.↩︎