The purpose of this homework is to introduce Hadoop, and practice with a couple of exercises. You'll be asked to turn in code. Make sure this code will work for other people besides you.
If you get stuck, ask me or one of your classmates for help. :-)
Implement a simple word-count program and run it with Hadoop over the English Gigaword corpus, which is installed on /data/eng_giga (on HDFS, not on the regular UNIX file system). The corpus is in 314 individual documents, all XML markup has been removed, except for the DOC ids, which have been retained in the form "<DOCID=XXXX>", punctuation has been split off, and standard NLP tokenization of contractions such as "can't" ("ca n't") has been performed.
While you are free to use native Java, or C++ with pipes, it will likely be easier (and about as efficient) to use streaming, which will mean that you can use any programming language you want. In that case, Python is highly recommended. If you really must use Java, I've excerpted a handy walkthrough about how to compile, package, and run Java programs for use with Hadoop.
But seriously, for this homework, you're much better off using the streaming API.
For streaming the mapper should output each key-value pair as a tab-delimited line. For example if you are doing this in Python, you might define:
def Output(key, value): print key + '\t' + value
For word count the key would be the word, the value the count. The mapper will collect the words and counts, and the reducer will combine them.
As discussed in Lin & Dyer, Chs. 2-3, there are a number of different approaches to implementing something like word count. The simplest is for the mapper to simply output
each time it encounters the word. The reducer will then total up the values for each key.
Another way is to have the mapper compute the totals for each word over the file that it is processing, and then output the total. These two approaches have implications for computational resources, so you should try both and see what the differences are. In particular compute the times for the simple versus the in-mapper aggregation versions over different sized subsets of the Gigaword corpus, and plot the results, similar to what Lin & Dyer have for the pairs and stripes approach in Figure 3.10.
Once you have your mapper and reducer working (you'll probably want to test them locally on a single file beforehand), you can write a script such as the following to submit the job on one of the bigbird machines:
#!/bin/bash set -x hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u6.jar \ -mapper `pwd`/mapper.py \ -reducer `pwd`/reducer.py \ -input /data/eng_giga \ -output /user/$YOURNAME/eng_giga_word_counts \ -numReduceTasks 10
If you're really motivated, look in
/user/steven/pubmed/data_json. You'll find the pubmed corpus; click here for a sample of the files' contents. Note that each line is extremely long! The first tab-separated field is the article's PMID; the second is a JSON-formatted serialization of a MEDLINE record. Its structure is complex, but should be consistent across lines; here is a formatted example of what each record should look like.
Adapt your program to de-serialize the MEDLINE record and extract the article's abstract; note that not all articles have abstracts, so your code will have to handle this gracefully!
Perform a word count of MEDLINE, and compare the top N words with the top N words from Gigaword. Are they similar? Different?
Write a map-reduce that extracts instances of ambiguous abbreviations in Gigaword, in context. The specific abbreviation to extract should be specified somewhere obvious in your code. A good example of an ambiguous abbreviation is "Dr", which could be either "Doctor" or "Drive". Collecting such examples in context is useful for, e.g., providing training data for a classifier that can disambiguate the two.
Taking "Dr" as the example, your mapper should find instances of "Dr", and output "Dr" in a window of words of some reasonable width -- say 4 words to the left and 4 words to the right. In case the abbreviation occurs at the beginning of the text, you should pad the left context with some easily identifiable padding token (e.g. "---"); similarly for the right context.
It is also desirable to filter contexts so that one does not have too many examples of similar contexts. Ideally one would like to classify contexts by some features that are suspected to be relevant for the ultimate classification task. In the case of "Doctor/Drive", one feature that is likely to be relevant are basic typographical features such as whether the word to the left is capitalized or lower-case, whether it is a number. One can then classify each token to the left and right as whether it starts with an upper ("U"), lower ("L"), digit ("N"), punctuation ("P") or else is a padding symbol ("-"). Thus a 4-word context "LLNU_PLUP" means the 4th word to the left is lower case, as is the 3rd word, the second word to the left is a number, the word to the left is upper case, and so forth.
These contexts can then become the keys, and the whole abbreviation-in-context string the values.
The reducer should accumulate the key/value pairs, allowing no more than some number (say 100) of any given context. It should also remove duplicate abbreviation-in-context strings.
Output from this map reduce might look something like this:
LLLL_LUUP no period after the Dr in Dr Pepper ? LLLL_PLLL agree more with what Dr . el-Baz told you LLLL_PLLL generous offer to compensate Dr . deNatale and repurchase LLLL_PLLL thing that mattered to Dr . al-Hazmi was to LLLL_PLLU an incomplete version of Dr . van den Haag LLLL_PLLU invariably referred to as Dr . except by The LLLL_PLLU the period after the Dr . on a Dr LLLL_PLPL about this from some Dr . who 'd tell LLLL_PLUL embodies the symbolic form Dr . von Franz called LLLL_PLUL government agreed to accept Dr . al Masari because
As one might expect, examples in contexts that include "NU_" (number followed by an upper-case word, then "Dr") tend to be "Drive":
LLNU_PLLL lease at 115 Dorsett Dr . was signed by LLNU_PLLU home at 2576 National Dr . in the Mill LLNU_PLUP home at 601 Riverview Dr . in Florence . LLNU_PPUL home at 929 Windermere Dr . " They said LLNU_PPUP home at 1000 Windermere Dr . " Oh , LLNU_PULL amiss at 1531 Elm Dr . But until last LLNU_PULL job at 305 Riverside Dr . A relative of LLNU_PUUP street from 13 Calle Dr . Santiago Veve .