Homework 2

This homework will have several parts.

Part 1: Inverted Index Construction

Here is a tab-delimited file containing a very small set of tweets. Each line is a document; the first field is the document ID, and the second is a tweet. You will write a program in the language of your choice to build an inverted index of these documents. You may use the approach described in the Manning book, or you may come up with your own algorithm.

What to turn in: your commented code, as well as print-out of the postings for all terms beginning with the letter "a".

Part 2: Merge algorithm

Implement the merge algorithm for intersecting the postings of two terms, as well as code to use it to process simple Boolean queries. When there are multiple query terms, make sure that your algorithm uses the optimization described in Manning, et al. Chapter 1 of performing the most restrictive intersection first. (Note that this will require your index to compute and store certain corpus statistics.)

Using your algorithm and the index you built in part 1 of the Twitter data, process the following queries:

egg AND cheese
chocolate AND strawberry
eggs AND cheese AND bacon

What to turn in: your commented code, as well as the list of document IDs matching each query.

Part 3: Scoring

Extend your system from part 2 to perform simple TF-IDF scoring of the retrieved results. For this data set, there is no need to worry about any weight normalizations.

What to turn in: your commented code, the sorted list of document IDs matching each of the queries from part 2, and a sentence or two about whether you think the sorting makes sense in this case.

Part 4: Larger data, and zone indexing

In this file, you will find a file containing data from approximately 100,000 PubMed Central articles. Each line of the file is an article, and each article is represented as a JSON object. Each article has a title, an abstract, a unique identifier (its PubMed ID, or PMID), and bibliographic metadata (a list of authors, journal publication info, etc.).

For this part of your assignment, run your algorithm on this larger data set, indexing both titles as well as abstracts, but doing so using the zone indexing technique described in Chapter 6 of Manning, et al., and again performing simple TF-IDF scoring. You may choose for yourself whether to encode zones in the term list or in the posting lists. Extend your merge algorithm to allow users to specify whether they want a term to appear in the title or the abstract.

Process the following simple Boolean queries:

title:monkey AND abstract:development
title:schistosoma
abstract:cholera AND title:risk
title:mosquito AND abstract:filariasis

Next, add an option to your system to perform weighted zone scoring, with a configurable weight given to the fields. Produce a set of ranked results for each query with the title field's weight set to twice that of the abstract field.

What to turn in: Your commented code, and, for each fo the four queries above, the number of document IDs (PMIDs, in this case) that your system found for each query, and the two ranked lists of results (with and without weighted zone scores). Also include a few sentences addressing whether there were notable differences in the ranking of the results between the weighted and unweighted conditions. Which set of results do you think is better? Also include few sentences describing your approach and anything interesting you found along the way.

Turn in your assignment via email, with "IR HW2" in the title. Assignments are due April 25th.

Back to syllabus