Basic Data Abstraction

Step 2: A first stab at data abstraction

Now that we have a set of articles that have passed initial screening, the next phase in a literature review often involves a certain amount of data abstraction. This might be to support an additional focusing of our review (perhaps we are only interested in studies involving certain populations) or simply to better understand the literature landscape (what countries are represented?).

Warning

This kind of data abstraction can be very labor-intensive, and LLMs have the potential to be useful assistants for this task. That said, it is also true that the process of manually abstracting data is, while tedious, the best way to really get to know your literature; and, of course, LLMs are not perfect, and are likely to make mistakes and miss important information.

Finding the right balance between automation and manual effort is something that you will need to figure out by experimentation, and it may vary from project to project.

Here are some examples of pieces of information we might want to abstract from our articles. Pick one, and try and write a prompt that reliably extracts this information from the articles in our dataset.

Geography: where did the study take place?
Study population: who were the participants included in the study?
Study design: what was the study’s experimental design (e.g. an RCT, a retrospective analysis, etc.)
Something else entirely?

Things to consider:

What information from/about the citation will the model need? I.e., will the abstract alone be enough, or will it also need the title or journal?
Will each paper have a single possible answer, or will there be multiple possible answers?
- If there are multiple possible answers, what do you want the model to do in that case?
What format do you want the output to take? A specific word or phrase, a delimited list of words/phrases, a narrative sentence, a number?
What do you want the model do if the requested information is not present in the input?

As before, try this out on a small and random subset of the dataset before running it on more inputs, and make sure to look at your data: are the results correct?

Moving on?

In the unlikely event that we have more time, let’s move on to our stretch goal, and learn about a more efficient way to extract data using ellmer!