Homework 6: Finite-State Approximate String Matching

Assigned: 12/10/2014

Due: 12/19/2014 (Firm!)

In this assignment, you will use OpenFST to implement a trie data structure, as well as a Levenshtein Automaton. You will use these data structures to build a small user interface for searching through a dictionary.

Part 1: Implementing a simple trie

Write a program to construct a trie, represented as an OpenFST transducer with letter symbols on the input side and word symbols on the output side. I recommend using PyFST, since it will make later parts of this assignment much easier- but you may use whatever method you wish.

If you installed OpenFST using Homebrew, you should be able to install PyFST with relatively little trouble.

Your trie implementation must be able to do the following things:

Initialize itself from a list of words or from a file;
Perform efficient prefix searching (i.e., quickly find all of the entries that begin with a prefix, returned as a list of strings)
Serialize the trie to disk so that other programs may use it.

Additionally, your program should have an "interactive mode", in which users can type in a prefix and see the list of matching words.

Question: a common usage scenario for prefix searching is "autocompletion" - a user is typing a word in a search box, and as they type, a list of possible matches appears. Each character that they type prunes that list. How might your code be modified to handle this scenario (a series of prefix searches, where each successive prefix is a direct continuation of the previous query) efficiently?

What to turn in:

Your code;
A sample terminal session, showing how to use it on /usr/share/dict/words, and your program's output;
A paragraph describing your approach, problems you had, ways you optimized your algorithm's performance, your thoughts on the prefix-searching question listed above, etc. etc.

Caution: PyFST has a quirk regarding symbol tables: it does not read or write the ASCII-formatted tables that we have all come to know and love; instead, it insists on reading its symbol tables in a binary format. To save you time and trouble, I have created two useful binary symbol tables:

ascii.syms.bin
words.syms.bin (containing an entry for each word in /usr/share/dict/words)

You can read these in to PyFST using the read_symbols() method. In both cases, the symbol for an ε is <epsilon>.

Note: You will find that PyFST is not very well documented. To save you some trouble, here is some example usage:

import fst

# make a new transducer:
my_transducer = fst.Transducer()

# read some symbol tables:
letter_syms = fst.read_symbols("ascci.syms.bin")

# manually set symbol tables
my_transducer.isyms = letter_syms
my_transducer.osyms = letter_syms

# looking things up in a symbol table:
letter_syms["m"] # 109
letter_syms.find(109) # "m"

s = "monkey"

# let's make an acceptor!
for idx, c in enumerate(s):
    my_transducer.add_arc(idx, idx + 1, c, c)

# subscripting with an int gets the state with that id
last_state = my_transducer[idx + 1] 
last_state.final = True

# some FST operations return a copy:
det_monkey = my_transducer.determinize()

# others happen in-place:
det_monkey.remove_epsilon()

# sort arcs on input labels (required for composition)
det_monkey.arc_sort_input()

# write our determinized monkeys to disk:
det_monkey.write("determinized_monkeys.fst")

states = StateCounter()
print states["my first state"] # 0
print states["another nifty state"] # 1
print states["my first state"] # 0

Part 2: Levenshtein Automata

Write a program to construct a simple, unweighted Levenshtein automaton given a word and an edit distance, and that uses the resulting automaton to search for similar words in the trie constructed in Part 1.

Note that since you'll be using your trie, you will need to make sure to use the same symbol table for your Levenshtein automaton that you did in part 1 of this assignment.

To implement insertions, you'll need to add <eps>:$char arcs for each letter of the alphabet; deletions and substitutions will follow a similar pattern.

You may do the searching in one of two ways:

Manually walk through the determinized Levenshtein automaton and trie, reporting any time both are in a final state;
Use FST algorithms (composition, etc.) and let OpenFST do the work for you.

Either way is fine for this assignment- it's up to you. Composition is an expensive operation, but makes for simpler code; implementing your own state machine traversal routine will result in a much faster algorithm, but could be a little tricky to get perfect.

An important consideration is whether to consider case changes (using "s" instead of "S", etc.) to be edits. How your program handles this is up to you, but you need to document which you chose!

As before, your program must support an interactive mode, in which users can type in a query string and an edit threshold, and the program will print out the list of similar words.

What to turn in:

Your code;
A sample terminal session, showing how to use your code as well as your program's output;
A short writeup describing your approach, especially about how you chose to do the searching. Did you let OpenFST do the walking, or did you live dangerously and do the FSA traversal yourself? How does your Levenshtein automaton handle case changes?

Note: I often find that OpenFST's behavior of referring to states only by their numerical ID to be very annoying; I would rather be able to look up states by a more meaningful identifier. To help with that, I wrote a tiny little utility class that you might find helpful: StateCounter. It is essentially a write-once defaultdict whose values are made up of a sequence of integers. The first time you ask it for a key, it will increment its counter and associate the new value with that key. So, the first key you ask it for will forever after return 0, the next 1, and so on. For example:

states = StateCounter()
print states["my first state"] # 0
print states["another nifty state"] # 1
print states["my first state"] # 0

I found this to a useful aid when manually constructing an FST.

Part 3: Extending Part 2

The automaton you built in Part 2 does a great job of producing a list of similar words. However, in a real application, we'd want to sort those somehow, for several reasons:

For some words, the list of similar words can be very long;
Two pairs of words that are the same distance apart in edit space can seem much further apart to humans from a perceptual standpoint ("leaf" and "leef" vs. "leaf" and "geaf");
If we're talking about spell-checking, some typos are much more common than others, which might affect the order in which we'd want to offer suggested corrections.

For the third part of the assignment, you'll modify your code from Part 2 to have a more nuanced picture of the relative importance of different errors. The best way to implement this is by assigning weights of some kind to our automaton's error arcs, and then using those weights to decide how to sort possible paths through the machine.

There are many possible weighting schemes, for example:

"i" and "e" get swapped a lot; when that happens, maybe it should probably "cost" less than when other letter pairs get swapped.
On a similar vein, inserting a vowel after another vowel might be less "bad" than inserting a consonant after another consonant.
Should phonetics count? Substituting a "k" for a "c" might be cheaper than substituting an "l" for a "c".
What about morphology? Adding "s" to the end of a word might be less of a problem than adding a "j".

Other ideas might involve working on the list of suggestions directly- should the list of suggestions be sorted according to unigram frequency within some corpus?

These are just a few ideas; I'm sure you can think of plenty more! Your task is to pick (at least) two approaches, and experiment with some test words.

What to turn in:

Your code;
A few paragraphs describing your approaches you chose, how well they worked, etc. Make sure to include example input words together with their ranked suggestions, and some discussion about how the different approaches that you implemented affected the rankings.

Back