Assigned: 10/27/2014
Due: 11/10/2014
In this assignment, you will build and evaluate a simple HMM-based part-of-speech tagger. There will be three parts to this assignment: implementing Viterbi decoding, computing transmission/emission probabilities using MLE and Witten-Bell smoothing, and evaluting your taggers.
Implement the Viterbi decoding algorithm using the description given in Chapter 5 of J&M, and the transition/emission tables given in the toy example. You'll be using the code you write here for the next part of the assignment, so try and write it in a generalizable manner.
To save you some typing and help you get started started, we've provided a skeleton viterbi.py
, including the transition and emission probabilities from Chapter 5. You don't need to use it, but if you want to, it's there.
What to turn in:
PPSS
, VB
, TO
, VB
, as described in J&M. p. 149).Tip: Don't forget about underflow! Use the
Tip: Watch out for off-by-one bugs as you translate the J&M formulae; also note that they assume 1-based indexing, rather than the 0-based indexing used by most modern programming languages
Most of the time, you won't be given pre-calculated transition and emission probabilities and will have to calculate them for yourself. We've prepared a chunk of pre-tagged Wall Street Journal corpus for you to work with: wsj.pos.gz
and wsj-normalized.pos.gz
. What's the difference between the two? wsj-normalized.pos.gz
has had certain tags "collapsed"—for example, any token that was tagged with CD
(number) has been replaced with the type-token <CD>
; ditto for NNP
and NNPS
(proper nouns, singular and plural respectively). All three tags are very common in the WSJ corpus, and collapsing them like this helps minimize sparsity; we've provided both versions so you can see what a difference it makes!
Your assignment, then, will be to compute new transition and emission matrices based on the normalized WSJ corpus. You'll be performing an evaluation of your tagger's performance in part 3 of the assignment, so it will be important to split your corpus into training and test sets. For this assignment, you will use a 90/10 split, meaning that 90% of the corpus will be for training and 10% will be for testing.
You will compute the transition and emission probabilities in two ways:
To save time, we are providing a simple implementation of both an MLE as well as a W-B-smoothed language model: ngrammodel.tgz
. This file contains several Python files, including ngrammodel.py
— look in this file for examples of how to use the modelsing classes. You will use these files as a starting point for computing transition and emission probabilities from the WSJ corpus- be forewarned, you will need to modify the code, as in its current form it expects different input.
When using MLE, you will also need a way to handle out-of-vocabulary items (OOVs); the simplest (though not particularly accurate) thing to do is to assume that all OOVs are generated by the same tag. For this assignment, let that tag be the most frequent one (NN
, in the Penn Treebank).
Once you've created the transition and emission probability matrices using the training data (90% of the corpus), use the Viterbi function from part 1 to tag the training data (the remaining 10%).
Steps to complete this part of the assignment:
What to turn in:
Tip: Randomly "shuffle" the sentences before you split them into test and training sets
Tip: Long sentences may take a while to decode (perhaps even a few seconds), but that doesn't mean you did anything wrong
Now that you've got two different training approaches, it's time to see which one works better.
What to turn in: