Homework 3: Language Modeling and Viterbi Decoding

Assigned: 10/27/2014

Due: 11/10/2014

In this assignment, you will build and evaluate a simple HMM-based part-of-speech tagger. There will be three parts to this assignment: implementing Viterbi decoding, computing transmission/emission probabilities using MLE and Witten-Bell smoothing, and evaluting your taggers.

Part 1: Implementing the Viterbi algorithm

Implement the Viterbi decoding algorithm using the description given in Chapter 5 of J&M, and the transition/emission tables given in the toy example. You'll be using the code you write here for the next part of the assignment, so try and write it in a generalizable manner.

To save you some typing and help you get started started, we've provided a skeleton viterbi.py, including the transition and emission probabilities from Chapter 5. You don't need to use it, but if you want to, it's there.

What to turn in:

  1. Your code
  2. Some sample terminal output, showing it producing the correct result (PPSS, VB, TO, VB, as described in J&M. p. 149).
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: Don't forget about underflow! Use the bitweight.py Python class to help.

Tip: Watch out for off-by-one bugs as you translate the J&M formulae; also note that they assume 1-based indexing, rather than the 0-based indexing used by most modern programming languages

Part 2: Calculating real-world probabilities

Most of the time, you won't be given pre-calculated transition and emission probabilities and will have to calculate them for yourself. We've prepared a chunk of pre-tagged Wall Street Journal corpus for you to work with: wsj.pos.gz and wsj-normalized.pos.gz. What's the difference between the two? wsj-normalized.pos.gz has had certain tags "collapsed"—for example, any token that was tagged with CD (number) has been replaced with the type-token <CD>; ditto for NNP and NNPS (proper nouns, singular and plural respectively). All three tags are very common in the WSJ corpus, and collapsing them like this helps minimize sparsity; we've provided both versions so you can see what a difference it makes!

Your assignment, then, will be to compute new transition and emission matrices based on the normalized WSJ corpus. You'll be performing an evaluation of your tagger's performance in part 3 of the assignment, so it will be important to split your corpus into training and test sets. For this assignment, you will use a 90/10 split, meaning that 90% of the corpus will be for training and 10% will be for testing.

You will compute the transition and emission probabilities in two ways:

  1. Using maximum-likelihood estimation (MLE)
  2. Using Witten-Bell smoothing

To save time, we are providing a simple implementation of both an MLE as well as a W-B-smoothed language model: ngrammodel.tgz. This file contains several Python files, including ngrammodel.py — look in this file for examples of how to use the modelsing classes. You will use these files as a starting point for computing transition and emission probabilities from the WSJ corpus- be forewarned, you will need to modify the code, as in its current form it expects different input.

When using MLE, you will also need a way to handle out-of-vocabulary items (OOVs); the simplest (though not particularly accurate) thing to do is to assume that all OOVs are generated by the same tag. For this assignment, let that tag be the most frequent one (NN, in the Penn Treebank).

Once you've created the transition and emission probability matrices using the training data (90% of the corpus), use the Viterbi function from part 1 to tag the training data (the remaining 10%).

Steps to complete this part of the assignment:

  1. Split the corpus into training/test partitions
  2. Write code to calculate MLE transition/emission probabilities
  3. Modify your Viterbi implementation from Part 1 to use the MLE probabilities instead of the hard-coded probabilities provided in Chapter 5 of J&M; don't forget about handling OOVs!
  4. Modify the provided W-B implementation to work with your Viterbi implementation

What to turn in:

  1. Your code to calculate MLE transition/emission probabilities, your modified Viterbi/MLE implementation
  2. Your modified W-B and Viterbi implementation
  3. Some sample terminal output showing both implementations tagging sentences
  4. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: Randomly "shuffle" the sentences before you split them into test and training sets

Tip: Long sentences may take a while to decode (perhaps even a few seconds), but that doesn't mean you did anything wrong

Part 3: Evaluation

Now that you've got two different training approaches, it's time to see which one works better.

  1. Run your MLE and WB implementations over the 10% of the corpus that you set aside for testing, and save the output somewhere
  2. For each implementation, compute classification accuracy of your tagger, using the tags in the original corpus as the "gold standard"
  3. Perform an error analysis: figure out which tag categories your tagger got wrong most often (e.g., verbs), and which specific tag-tag confusions (i.e., mistakes) it made the most (e.g., tagging something as a noun when it should have been a verb); you can use the confusion matrix for this part to compute accuracy
  4. Also look at some commonly mis-tagged words; do you see any patterns?

What to turn in:

  1. A couple of paragraphs, including the accuracy of MLE and W-B taggers and your error analysis findings; also, feel free to discuss what was hard, easy, surprising, boring, etc.