CS 5/662, Winter 2021
In this assignment, you will be working with syntax trees. The three parts to the assignment are:
To save you headache, I have provided [a Python file]((/bedricks/courses/cs662_winter_2021/resources/hw6_files/tree.py) containing code that will read serialized parse trees from PTB .psd
file into a Python utility class. This class has several methods to help with tree traversal, and also is capable of performing certain tree transforms as discussed in class (specifically, collapsing unary productions and converting a tree into Chomsky Normal Form).
Before you start the assignment, I strongly encourage you to take a few minutes to look over the code in this file and familiarize yourself with the Tree
class, and to play around with it a little bit in the terminal (or a Jupyter notebook) to get a feel for how it works. Most of the various methods include docstrings with unit tests that demonstrate how it is meant to be used; pay particular attention to the “magic” methods (__iter__
, __getitem__
, etc.). If you are not sure how these should be used, ask for help. In the past, students have wasted an enormous amount of time and effort re-inventing functionality that was already present in tree.py
.
Relevant files for this assignment are:
tree.py
wsj-normalized.psd.gz
: A sizeable subset of the WSJ section of the Penn Treebank, with certain normalizations applied (replacing proper names with NNP
class labels, etc.) to reduce its complexity.bigger_treebank_2.txt.gz
: A small subset of the WSJ, used in part 3end_of_wsj.txt.gz
: A handful of sample example sentences, for use in part 3Your final assignment is not a tree transform per se. Rather, you are to to implement the productions()
instance method on the Tree
object from tree.py
. This will be a function which generates all productions (i.e., context-free rules) of a tree in Chomsky normal form. For instance, the tree
(TOP
(S
(VP
(TO to)
(VP
(VB play)
)
)
)
)
contains the following productions:
TOP -> S
S -> VP
VP -> TO VP
VP -> VB
TO -> "to"
VB -> "play"
The docstring shows the expected behavior of the function; note that there is a Production
namedtuple on line 42 that you are to use for the output.
What to turn in:
Use your now-completed tree.py
to write a program that will read in a treebank and induce an unsmoothed, maximum-likelihood-estimated PCFG. For example, if given the following input, consisting of a single tree:
(TOP
(NP
(DT the)
(NN teacher)
)
(VP
(MD will)
(VP
(VB lecture)
(NP
(NN today)
(PP
(IN in)
(NP
(DT the)
(NN lecture)
(NN hall)
)
)
)
)
)
(. .)
)
Your script should induce the following grammar:
TOP -> NP VP . 1.0
NP -> DT NN 0.3333333333333333
NP -> NN PP 0.3333333333333333
NP -> DT NN NN 0.3333333333333333
DT -> the 1.0
NN -> teacher 0.25
NN -> today 0.25
NN -> lecture 0.25
NN -> hall 0.25
VP -> MD VP 0.5
VP -> VB NP 0.5
MD -> will 1.0
VB -> lecture 1.0
PP -> IN NP 1.0
IN -> in 1.0
. -> . 1.0
NOTE that this particular sample tree (and thus its resulting induced PCFG) is not in CNF- it’s just an example! You’ll want to try out your algorithm on a few different input examples, and verify that it produces a CNF-valid grammar if given a CNF-valid tree. And, of course, the actual input will consist of more than one tree, so you’ll need to follow a similar pattern as you did in the first assignment when computing n-gram statistics (accumulate counts, convert to MLE probabilities, possibly working in log space).
You will be needing to read in and use your induced grammar in part 3, so make sure your program can save its grammar to disk (I would personally just use pickling).
What to turn in:
wsj-normalized.psd.gz
.Tip: Start thinking about what kind of data structure you want to use to store the production rules and their probabilities. You’ll be using this for implementing CYK, so keep in mind the various kinds of “questions” that the CYK algorithm needs to be able to “ask” of its grammar.
Tip: Tuples can be keys in Python dictionaries
Tip: Remember from the first assignment that collections.defaultdict
is your friend!
Tip: Remember that Python comes with a standard library module that will let you read in gzipped output; you don’t need to unzip it!
Implement the CYK algorithm both for grammar recognition as well as parse recovery, using the PCFG-induction code you wrote for part 2 of this assignment (but on a different, smaller dataset). The relevant chapters of J&M, as well as the lecture slides, will be your friends. You may also find this online CYK demo useful.
More concretely, you are to write a program that takes as its arguments a PCFG (induced from a tree or set of trees, using your code from part 2) and a sentence, and returns that sentence’s best parse tree as per the CYK algorithm. If the sentence cannot be parsed, your program should indicate this by returning some sort of error condition— None
, False
, a non-zero return code, an Exception
, etc.— it shouldn’t just crash.
What to turn in:
bigger_treebank_2.txt.gz
, and try running the trees from end_of_wsj.txt.gz
through your CYK implementation against this grammar. Many of the trees will not parse; some will. You will turn in a file telling us which parsed and which didn’t. Each line of the file will correspond to a tree from end_of_wsj.txt.gz
, and will contain a 0
if that tree parsed or a 1
if it did not.Tip: Watch out for off-by-one errors!
Tip: Make sure you pre-process your trees by removing non-pre-terminal unary productions and converting to Chomsky Normal Form!
Tip: The best way to begin is to start small. Rather than building a grammar out of the entire WSJ treebank and starting with a long and complex sentence, try instead to induce a grammar from a single short, simple sentence, and then try and run your CYK implementation on that. Having a small sentence that you can parse by hand and whose rules are very simple will make it much easier to debug the various off-by-one errors, etc. that will inevitably crop up, and a small and simple PCFG will make it easier to debug the math.
Tip: When I was developing this assignment, I found it to be very useful to have a function that could print out the current state of my CYK chart in a human-readable format. I actually wrote a function that made an HTML table containing my chart state, and when my algorithm was misbehaving, being able to “see” what it was doing made it much easier to figure out what was going wrong.
Tip: I’ll say it again: start small, with a single input sentence and its corresponding set of productions.