CS5/606: Problem Solving with Large Clusters

Homework 1

Handed out: March 31, 2014
Due: 23:59 on April 6, 2014

Description

The purpose of this homework is to introduce Hadoop, and practice with a couple of exercises. You'll be asked to turn in code. Make sure this code will work for other people besides you.

If you get stuck, ask me or one of your classmates for help. :-)

Preliminaries

You'll need to have a working login for the bigbirds; if you don't have one, let me know and I will get you set up.

Exercise 1

Implement a simple word-count program and run it with Hadoop over the English Gigaword corpus, which is installed on /data/eng_giga (on HDFS, not on the regular UNIX file system). The corpus is in 314 individual documents, all XML markup has been removed, except for the DOC ids, which have been retained in the form "<DOCID=XXXX>", punctuation has been split off, and standard NLP tokenization of contractions such as "can't" ("ca n't") has been performed.

While you are free to use native Java, or C++ with pipes, it will likely be easier (and about as efficient) to use streaming, which will mean that you can use any programming language you want. In that case, Python is highly recommended. If you really must use Java, I've excerpted a handy walkthrough about how to compile, package, and run Java programs for use with Hadoop.

But seriously, for this homework, you're much better off using the streaming API.

For streaming the mapper should output each key-value pair as a tab-delimited line. For example if you are doing this in Python, you might define:

def Output(key, value):
  print key + '\t' + value

For word count the key would be the word, the value the count. The mapper will collect the words and counts, and the reducer will combine them.

As discussed in Lin & Dyer, Chs. 2-3, there are a number of different approaches to implementing something like word count. The simplest is for the mapper to simply output

word  1

each time it encounters the word. The reducer will then total up the values for each key.

Another way is to have the mapper compute the totals for each word over the file that it is processing, and then output the total. These two approaches have implications for computational resources, so you should try both and see what the differences are. In particular compute the times for the simple versus the in-mapper aggregation versions over different sized subsets of the Gigaword corpus, and plot the results, similar to what Lin & Dyer have for the pairs and stripes approach in Figure 3.10.

Once you have your mapper and reducer working (you'll probably want to test them locally on a single file beforehand), you can write a script such as the following to submit the job on one of the bigbird machines:

#!/bin/bash
set -x

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \
		-mapper `pwd`/mapper.py \
		-reducer `pwd`/reducer.py \
		-input /data/eng_giga \
		-output /user/$YOURNAME/eng_giga_word_counts \
		-numReduceTasks 10

What to turn in:

A file containing the 100 most frequent words and their counts
Your code.
A PDF with a plot showing the performance of the two word counting algorithms over different percentages of the Gigaword corpus.
A paragraph describing your approach, and what you found particularly interesting, surprising, difficult, fun, etc. about the problem.

For extra fun/credit:

If you're really motivated, look in

/user/steven/pubmed/data_json

. You'll find the pubmed corpus; click here for a sample of the files' contents. Note that each line is extremely long! The first tab-separated field is the article's PMID; the second is a JSON-formatted serialization of a MEDLINE record. Its structure is complex, but should be consistent across lines; here is a formatted example of what each record should look like.

Adapt your program to de-serialize the MEDLINE record and extract the article's abstract; note that not all articles have abstracts, so your code will have to handle this gracefully!

Perform a word count of MEDLINE, and compare the top N words with the top N words from Gigaword. Are they similar? Different?

Notes

Remember that each input file will get its own mapper, and that this mapper will process each line of the file in sequence.
You won't be able to view the Hadoop job tracker web page from off-campus; if you're really interested, I can show you how to get around this limitation.

Exercise 2

Write a map-reduce that extracts instances of ambiguous abbreviations in Gigaword, in context. The specific abbreviation to extract should be specified somewhere obvious in your code. A good example of an ambiguous abbreviation is "Dr", which could be either "Doctor" or "Drive". Collecting such examples in context is useful for, e.g., providing training data for a classifier that can disambiguate the two.

Taking "Dr" as the example, your mapper should find instances of "Dr", and output "Dr" in a window of words of some reasonable width -- say 4 words to the left and 4 words to the right. In case the abbreviation occurs at the beginning of the text, you should pad the left context with some easily identifiable padding token (e.g. "---"); similarly for the right context.

It is also desirable to filter contexts so that one does not have too many examples of similar contexts. Ideally one would like to classify contexts by some features that are suspected to be relevant for the ultimate classification task. In the case of "Doctor/Drive", one feature that is likely to be relevant are basic typographical features such as whether the word to the left is capitalized or lower-case, whether it is a number. One can then classify each token to the left and right as whether it starts with an upper ("U"), lower ("L"), digit ("N"), punctuation ("P") or else is a padding symbol ("-"). Thus a 4-word context "LLNU_PLUP" means the 4th word to the left is lower case, as is the 3rd word, the second word to the left is a number, the word to the left is upper case, and so forth.

These contexts can then become the keys, and the whole abbreviation-in-context string the values.

The reducer should accumulate the key/value pairs, allowing no more than some number (say 100) of any given context. It should also remove duplicate abbreviation-in-context strings.

Output from this map reduce might look something like this:

LLLL_LUUP       no period after the Dr in Dr Pepper ?
LLLL_PLLL       agree more with what Dr . el-Baz told you
LLLL_PLLL       generous offer to compensate Dr . deNatale and repurchase
LLLL_PLLL       thing that mattered to Dr . al-Hazmi was to
LLLL_PLLU       an incomplete version of Dr . van den Haag
LLLL_PLLU       invariably referred to as Dr . except by The
LLLL_PLLU       the period after the Dr . on a Dr
LLLL_PLPL       about this from some Dr . who 'd tell
LLLL_PLUL       embodies the symbolic form Dr . von Franz called
LLLL_PLUL       government agreed to accept Dr . al Masari because

As one might expect, examples in contexts that include "NU_" (number followed by an upper-case word, then "Dr") tend to be "Drive":

LLNU_PLLL       lease at 115 Dorsett Dr . was signed by
LLNU_PLLU       home at 2576 National Dr . in the Mill
LLNU_PLUP       home at 601 Riverview Dr . in Florence .
LLNU_PPUL       home at 929 Windermere Dr . " They said
LLNU_PPUP       home at 1000 Windermere Dr . " Oh ,
LLNU_PULL       amiss at 1531 Elm Dr . But until last
LLNU_PULL       job at 305 Riverside Dr . A relative of
LLNU_PUUP       street from 13 Calle Dr . Santiago Veve .

What to turn in:

Examples to demonstrate that this works, for at least two different abbreviations;
Your code;
Two paragraphs describing both your technical approach as well as what you found along the way. Did Hadoop, or your program, do something surprising? Etc.