For our literature review, we are interested in articles that describe clinical trials about pharmaceutical or surgical interventions designed to prevent or treat schistosomiasis. The results of the literature review are in the project folder, in a file called lit_review.xlsx. Take a minute to familiarize yourself with this file by opening it in Excel. What do you notice?
The actual search query we ran retrieved many results, most of which are relevant, but some of which are not. Some are simply not relevant to the topic at all, and are essentially “false positives”; others are on-topic, but are not relevant for the purposes of our study.
Note
Here is an example of such an article, describing an environmental intervention that removes habitat for the snails that serve as hosts for the Schistosoma helminths.
Note
Here is an example of a particularly blatant “false positive”; can you figure out how it ended up in our search results?
Tip
Hint: look at the author affiliations….
In a typical literature review process, one would go through each result, “by hand”, as part of an initial screening at the title and abstract level to determine inclusion status. Is this something that an LLM might be able to help us with (keeping in mind the limitations discussed above)? Let’s find out!
Data Ingestion and EDA
We begin, as always, with some simple exploratory data analysis. Remember, the “First Commandment” of data science is to Know Your Data.
We see that we have 100 articles; you should see all the same columns here as were present in Excel. Let’s see what we are working with in terms of distribution of data:
Are there other things you might look for during your initial EDA?
Classifying by inclusion status
Now let’s try and identify results in our initial review that are out of scope. Your goal will be to write a prompt that scans an abstract from our review, and determines whether or not it is in scope. Since the goal will be to use the result of this query to filter our literature review dataframe, we will want the model to answer with either “true” or “false”, depending on whether the article is in-scope; make sure to include something about this in your instructions.
Recall that an article is in scope if it…
Describes a clinical trial…
About a pharmaceutical or surgical intervention…
Designed to treat or prevent schistosomiasis.
Here is a scaffold of a function for you to start with; you’ll want to fill in the placeholder for the system prompt.
is_in_scope <-function(some_abstract) { c <-chat_openai(model ="gpt-4.1-nano",params=params(seed=42),system_prompt="INSERT SYSTEM PROMPT HERE" ) c$chat(some_abstract, echo=FALSE) # no need to print, we're in a function}
Tip
Start by working on a handful of examples from the lit review, not the entire dataset!
I recommend taking a random sample; if you go with this approach, make sure to call set.seed() in your R notebook, to ensure that your results are repeatable!
Iteration and evaluation are critical to this workflow:
Make sure to look at your data: do you agree with the model’s assessment?
Check with your neighbor: how similar are your prompts? How different?
Once you are happy with your prompt’s behavior, apply your function to each search result, and create a filtered version of our dataset. Possible questions to explore at this point in your workflow:
How many articles did we end up excluding?
Review the excluded set of articles (there should not be too many of them); do you see any obvious false positives to our filter?
Additional Bonus Exercise
Try two approaches:
Write a single prompt that captures all three dimensions of relevance
Write a separate and focused prompt for each dimension of relevance
For some problems, you may find it helpful to break things down into more discrete and concrete queries; other questions are better suited towards longer and more complete prompts. In general, larger (and thus more expensive) models will be better able to process long prompts with multiple instructions.
Moving On
Next, let’s see about using our LLM to help with data abstraction!