Homework 3: Off-the-Shelf Tools

Introduction

There are many off-the-shelf IR tools available, many of which are extremely powerful. Often, when you need IR (either as part of a research project or as a component in another application), it's best to start with an already-existing tool rather than building your own completely from scratch. The point of this assignment is to give you hands-on experience with using and customizing existing software, so be prepared for some digging through API documentation.

In this assignment, you will be applying IR to a real-world research problem that I am currently working on with a colleague. You may be aware that, in the US, many categories of medical research projects are required to register themselves on clinicaltrials.gov. Many NIH-funded research projects fall into this category, and the rules have recently expanded to require registration from more types of study than was previously the case (see here for the gory details). The main NIH database of grants, the NIH Reporter, is supposed to contain a link between a grant's record and its clinicaltrials.gov entry (see here for totally random example).

Many research projects that should register do not. Registration is an important component of our country's regulatory framework for medical research, and studies that go un-registered can (and often are) essentially unsupervised and unregulated. We are only just beginning to attempt to quantify the magnitude of the problem, but the idea is ultimately to use computational to automatically identify grants that ought to register. For this assignment, we will explore the use of IR systems to identify randomized controlled trials (RCTs). Not all RCTs must register (animal studies, for example, are exempt, as are studies that take place entirely outside the US), and not all studies that must register are RCTs, but the vast majority of RCTs ought to have a clinicaltrials.gov entry, so it is a reasonable place to start. (Computational identification of study design is an important biomedical NLP and IR problem in its own right; see Cohen, et al. 2015 for a good overview.)

I have assembled data set for you to work with, consisting of all R01 grants awarded by the NIH from 2009-2013, along with their clinicaltrials.gov registration status. The file consists of just under 20,000 grants, each of which is on its own line, represented as a JSON object. The format is relatively simple: each grant has several fields, including a title, an abstract, a set of metadata terms, and so on. Notably, some grants also include a list of known clinicaltrials.gov registrations.

grants.json.gz - A gzipped file of all 19,507 grants
grants.first100.json - The first 100 grants, to help get started.

We'll be working with two different IR systems in this assignment: Apache Lucene and Elasticsearch. In some ways, it's more like 1.5 systems, since Elasticsearch is built on top of Lucene, but it has many different features and is useful in different ways.

For each of the systems, you will need to figure out how to issue the following queries (searching over both the title as well as the abstract):

A simple term query, looking for instances of "RCT"
A simple phrase query, looking for "randomized controlled trial"
A boolean query, looking for either "RCT" OR "randomized controlled trial"
A wildcard query, looking for terms that begin with the prefix "random"
A fuzzy positional query, that will catch both "randomized controlled trial" as well as "randomized controlled clinical trial"

Additionally, you will review each system's query syntax capabilities, and generate a new query of your own devising that you think will maximize the number of RCTs retrieved (and minimize the number of false positives- i.e., grants that do not describe RCTs). To do this, you will need to spend some time looking at the search results returned by the earlier queries.

For each of the queries, you will create a 2x2 table. The rows will represent documents that were or were not retrieved by the query, and the columns will represent grants that did or did not register with clinicaltrials.gov. In the "real-world", we would use these to help guide the next steps of the project: for example, to identify

grants that registered a trial but did not describe themselves as an RCT...
characteristics of RCTs that did not need to register, so that we can exclude them from future queries...
RCTs that should have registered but did not.

The last example being, of course, what we're looking for. Our next steps might be to train a more specific text classification model, and so forth.

Part 1: Raw Lucene

Lucene is an open-source IR toolkit written in Java. It is extremely powerful and flexible, and is also very well documented. I have written a very simple Lucene program that will do the most basic possible index of our PubMed Central articles, and that demonstrates how to run very basic queries.

Your task will be to take my simple program, and modify it in the following ways:

My program uses the default StandardAnalyzer, which only does case folding and English stopword removal. It does no stemming, meaning (for example) that "trial" and "trials" would not match. Lucene comes with several more sophisticated analyzers, including the EnglishAnalyzer. Like the StandardAnalyzer, the EnglishAnalyzer does case folding and so on- but it also does stemming, handles possessives better, and so forth. Figure out how to use it in place of the StandardAnalyzer.
Currently, my program only indexes titles. Extend it to index abstracts as well as metadata terms. This will require a little bit of monkeying around in the data, in order to figure out how these other fields are being stored in the JSON records.
What the Manning book calls weighted zone scoring, Lucene calls field-level boosting. Set up the program to give the title field twice the weight of the abstract field. Hint: look in the documentation for org.apache.lucene.document.Field for ideas on how to do this!
Currently, my program uses a very basic query parser- org.apache.lucene.queryparser.classic.QueryParser, which (while powerful) is somewhat finicky in terms of its syntax, and not very user-friendly. Lucene provides a much more advanced and flexible query parser, org.apache.lucene.queryparser.simple.SimpleQueryParser. Figure out how to swap SimpleQueryParser in to the program in place of plain ol' QueryParser, and play around with its capabilities.

Believe it or not, most of these are very minor changes— that being a big part of Lucene's appeal!

Here is the simple program. When you extract it, you will find a .java file containing the program itself, a directory containing the necessary .jar libraries, and a shell script that will compile and run the program. The shell script takes three arguments:

should_clear_index: whether to rebuild the index (1), or to use an existing one (0).
path_to_data: the path to the data file containing grants to index (e.g. grants.first100.json).
path_to_index: a path where you would like Lucene to save its index (or the path of the existing index you would like to search, if should_clear_index is set to 0).

Now that you've got a modified search system, use it to issue the queries specified in the introduction, and produce the 2x2 tables.

Note that Lucene is written in Java; if you are unfamiliar with Java, this part of the assignment may be challenging. Feel free to come talk to me if you need help getting started!

What to turn in: your commented code, as well as a paragraph discussing your experience working with Lucene. What effect, if any, did the zone weighting have on the search results for the queries? What happened to the result ranking if you weighted the abstracts more heavily than the titles?

Part 2: Elasticsearch

Elasticsearch is an open-source IR system built on top of Lucene. It is designed for high-performance search over large indicies, and includes many complex features that allow for replication across servers and so forth. One interacts with Elasticsearch via an HTTP-based API, and there are client libraries written for many different languages. In addition to being an IR system, Elasticsearch has evolved into a general-purpose document-oriented database, and includes powerful features for range querying, aggregation, etc.

Elasticsearch is a bit of a bear to work with, but is useful enough to be worth becoming familiar with. For this part of the assignment, I have provided a very minimal example of how to create and query a new index using the Python client library, but you will certainly need to go beyond what's in that file to complete the assignment, and you should feel free to use whatever language you like. This part of the assignment has the following steps:

Install Elasticsearch, and read through the "Getting Started" part of the manual. (Note: if you are running Mac OS X and using homebrew, you can install elastic search by running "brew install elasticsearch")
By default, Elasticsearch builds indicies with something very similar to Lucene's StandardAnalyzer, meaning that stemming, etc. is not performed on documents as they are indexed. Elasticsearch comes with many different built-in analyzers, and it is possible to configure your index to use a different one by default (it is also possible to define per-field analyzers, e.g. so you can have one field set up with French stop-word removal and another with English). Figure out how to set up your Elasticsearch index to use the English-language stopword and stemming rules (i.e., the English language analyzer).
Experiment with field-level boosting at query time, as in the Lucene example.
Lucene has many types of queries, including regular expression matching, simple term-field matching, multi-field matching, etc., as well as some simpler methods for searching. Read about the query system, and experiment with some of the different query forms.
One of the most useful things about Elasticsearch is its highlighting functionality- it is very easy to get it to give you snippets of text showing where your query matched a result (it is essentially a nice API on top of Lucene's very similar functionality, but one must never underestimate the value of a good API!). Extend your code to produce, in addition to 2x2 tables, a simple formatted HTML report showing, for each cell, the context in which the query occurred in a match. (hint: if you're using the Python elasticsearch-dsl library, look here))

What to turn in: your commented code, your formatted report, as well as a paragraph discussing your experience working with Elasticsearch. Was it easier to use than Lucene? Harder? Were there any features that you found particularly useful for these queries? What about features that were missing?

Turn in your assignment via email, with "IR HW3" in the title. Assignments are due Tuesday May 23rd, by 11:59 PM.