Homework 2

This homework will have several parts.

Part 1: Inverted Index Construction

Here is a tab-delimited file containing a very small set of tweets. Each line is a document; the first field is the document ID, and the second is a tweet. You will write a program in the language of your choice to build an inverted index of these documents. You may use the approach described in the Manning book, or you may come up with your own algorithm.

What to turn in: your commented code, as well as print-out of the postings for all terms beginning with the letter "a".

Part 2: Merge algorithm

Implement the merge algorithm for intersecting the postings of two terms, as well as code to use it to process simple Boolean queries. When there are multiple query terms, make sure that your algorithm uses the optimization described in Manning, et al. Chapter 1 of performing the most restrictive intersection first.

Using your algorithm and the index you built in part 1 of the Twitter data, process the following queries:

egg AND cheese
chocolate AND strawberry
eggs AND cheese AND bacon

What to turn in: your commented code, as well as the list of document IDs matching each query.

Part 3: Scoring

Extend your system from part 2 to perform simple TF-IDF scoring of the retrieved results. For this data set, there is no need to worry about any weight normalizations.

What to turn in: your commented code, the sorted list of document IDs matching each of the queries from part 2, and a sentence or two about whether you think the sorting makes sense in this case.

Part 3: Larger data, and zone indexing

In this file, you will find a file containing data from approximately 100,000 PubMed Central articles. Each line of the file is an article, and each article is represented as a JSON object. Each article has a title, an abstract, a unique identifier (its PubMed ID, or PMID), and bibliographic metadata (a list of authors, journal publication info, etc.).

For this part of your assignment, run your algorithm on this larger data set, indexing both titles as well as abstracts, but doing so using the zone indexing technique described in Chapter 6 of Manning, et al., and again performing simple TF-IDF scoring. You may choose for yourself whether to encode zones in the term list or in the posting lists. Extend your merge algorithm to allow users to specify whether they want a term to appear in the title or the abstract.

Process the following simple Boolean queries:

title:monkey AND abstract:development
titletitle:schistosoma
abstract:cholera AND title:risk
title:mosquito AND abstract:filariasis

What to turn in: Your commented code, the number of document IDs (PMIDs, in this case) that your system found for each query, along with the list of results, and a few sentences describing your approach and anything interesting you found along the way.

Part 4: Lucene

Lucene is an open-source IR toolkit written in Java. It is extremely powerful and flexible. It is also very well documented. I have written a very simple Lucene program that will do the most basic possible index of our PubMed Central articles, and that can handle very basic queries.

Your task will be to take my simple program, and modify it in the following ways:

My program uses the default StandardAnalyzer, which only does case folding and English stopword removal. Lucene comes with a more sophisticated analyzer, the EnglishAnalyzer. Like the StandardAnalyzer, the EnglishAnalyzer does case folding and so on- but it also does stemming, handles possessives better, and so forth. Figure out how to use it in place of the StandardAnalyzer.
Currently, my program only indexes titles. Extend it to index abstracts as well as author last names. This will require a little bit of monkeying around in the data, in order to figure out how the names are being stored.
What the Manning book calls weighted zone scoring, Lucene calls field-level boosting. Set up the program to give the title field twice the weight of the abstract field. Hint: look in the documentation for org.apache.lucene.document.Field for ideas on how to do this!
Currently, my program uses a very basic query parser- org.apache.lucene.queryparser.classic.QueryParser. Lucene provides a much more advanced and flexible query parser, org.apache.lucene.queryparser.simple.SimpleQueryParser. Figure out how to swap SimpleQueryParser in to the program in place of plain ol' QueryParser, and play around with its capabilities.

Believe it or not, most of these are very minor changes— that being a big part of Lucene's appeal!

Here is the simple program. When you extract it, you will find a .java file containing the program itself, a directory containing the necessary .jar libraries, and a shell script that will compile and run the program. The shell script takes three arguments:

should_clear_index: whether to rebuild the index (1), or to use an existing one (0).
path_to_data: the path to the data file from Part 3 of this assignment (data.txt).
path_to_index: a path where you would like Lucene to save its index (or the path of the existing index you would like to search, if should_clear_index is set to 0).

Run the queries from part 3 on your new and improved Lucene program- you might need to adapt them to use the Lucene query syntax (try it without adapting and see what happens!). Run the queries both with and without the zone weighting.

Note that Lucene is written in Java; if you are unfamiliar with Java, this part of the assignment may be challenging. Feel free to come talk to me if you need help getting started!

What to turn in: your commented code, as well as a paragraph discussing your experience working with Lucene. What effect, if any, did the zone weighting have on the search results for the queries? What happened to the result ranking if you weighted the abstracts more heavily than the titles?

Turn in your assignment via email, with "IR HW2" in the title. Assignments are due April 16th.

Back to syllabus