There are many off-the-shelf IR tools available, many of which are extremely powerful. Often, when you need IR (either as part of a research project or as a component in another application), it's best to start with an already-existing tool rather than building your own completely from scratch. The point of this assignment is to give you hands-on experience with using and customizing existing software, so be prepared for some digging through API documentation.
In this assignment, you will be applying IR to a real-world research problem that I am currently working on with a colleague. You may be aware that, in the US, many categories of medical research projects are required to register themselves on clinicaltrials.gov. Many NIH-funded research projects fall into this category, and the rules have recently expanded to require registration from more types of study than was previously the case (see here for the gory details). The main NIH database of grants, the NIH Reporter, is supposed to contain a link between a grant's record and its clinicaltrials.gov entry (see here for totally random example).
Many research projects that should register do not. Registration is an important component of our country's regulatory framework for medical research, and studies that go un-registered can (and often are) essentially unsupervised and unregulated. We are only just beginning to attempt to quantify the magnitude of the problem, but the idea is ultimately to use computational to automatically identify grants that ought to register. For this assignment, we will explore the use of IR systems to identify randomized controlled trials (RCTs). Not all RCTs must register (animal studies, for example, are exempt, as are studies that take place entirely outside the US), and not all studies that must register are RCTs, but the vast majority of RCTs ought to have a clinicaltrials.gov entry, so it is a reasonable place to start. (Computational identification of study design is an important biomedical NLP and IR problem in its own right; see Cohen, et al. 2015 for a good overview.)
I have assembled data set for you to work with, consisting of all R01 grants awarded by the NIH from 2009-2013, along with their clinicaltrials.gov registration status. The file consists of just under 20,000 grants, each of which is on its own line, represented as a JSON object. The format is relatively simple: each grant has several fields, including a title, an abstract, a set of metadata terms, and so on. Notably, some grants also include a list of known clinicaltrials.gov registrations.
We'll be working with two different IR systems in this assignment: Apache Lucene and Elasticsearch. In some ways, it's more like 1.5 systems, since Elasticsearch is built on top of Lucene, but it has many different features and is useful in different ways.
For each of the systems, you will need to figure out how to issue the following queries (searching over both the title as well as the abstract):
Additionally, you will review each system's query syntax capabilities, and generate a new query of your own devising that you think will maximize the number of RCTs retrieved (and minimize the number of false positives- i.e., grants that do not describe RCTs). To do this, you will need to spend some time looking at the search results returned by the earlier queries.
For each of the queries, you will create a 2x2 table. The rows will represent documents that were or were not retrieved by the query, and the columns will represent grants that did or did not register with clinicaltrials.gov. In the "real-world", we would use these to help guide the next steps of the project: for example, to identify
Your task will be to take my simple program, and modify it in the following ways:
Here is the simple program. When you extract it, you will find a .java file containing the program itself, a directory containing the necessary .jar libraries, and a shell script that will compile and run the program. The shell script takes three arguments:
Now that you've got a modified search system, use it to issue the queries specified in the introduction, and produce the 2x2 tables.
Note that Lucene is written in Java; if you are unfamiliar with Java, this part of the assignment may be challenging. Feel free to come talk to me if you need help getting started!
What to turn in: your commented code, as well as a paragraph discussing your experience working with Lucene. What effect, if any, did the zone weighting have on the search results for the queries? What happened to the result ranking if you weighted the abstracts more heavily than the titles?
Elasticsearch is an open-source IR system built on top of Lucene. It is designed for high-performance search over large indicies, and includes many complex features that allow for replication across servers and so forth. One interacts with Elasticsearch via an HTTP-based API, and there are client libraries written for many different languages. In addition to being an IR system, Elasticsearch has evolved into a general-purpose document-oriented database, and includes powerful features for range querying, aggregation, etc.
Elasticsearch is a bit of a bear to work with, but is useful enough to be worth becoming familiar with. For this part of the assignment, I have provided a very minimal example of how to create and query a new index using the Python client library, but you will certainly need to go beyond what's in that file to complete the assignment, and you should feel free to use whatever language you like. This part of the assignment has the following steps:
What to turn in: your commented code, your formatted report, as well as a paragraph discussing your experience working with Elasticsearch. Was it easier to use than Lucene? Harder? Were there any features that you found particularly useful for these queries? What about features that were missing?