Homework 2: Finite-State Transducers

Assigned: 10/11/2014

Due: 10/20/2014

In this assignment, you will be working with finite-state transducers and OpenFST. The main course page's "Resources" section lists several useful resources, such as the fstprintstrings program and so on, that you will need for the various parts of this assignment.

Unless otherwise specified, you may use whatever interface you like to OpenFST: the command-line OpenFST tools, the C++ API, pyfst, Thrax, etc.

Part 1: Number-name generator

You will build a finite-state transducer that takes a number between 0 and 99,999 and turn it into words. For example, it will take 123 and produce "one hundred twenty three". Note that you will need to create a new symbol table for the output, which will contain words like "one" and "hundred". For the input, just use the ascii.syms.txt symbol table.

For this assignment, the "default" is to use English-language number names, but if you would like to try a different language's number naming system, you should feel free to do so. For example, many South Asian languages use lakh to describe units of 100,000 ("three lakh fifty two thousand" for 352,000), in much the same way as English uses dozen for units of 12 ("three dozen" for 36). If it sounds like more fun to implement a Hindi number-naming system than it does to do an English one, go for it! Likewise for Basque, or whatever system strikes your fancy.

What to turn in:

  1. Your transducer itself, in ASCII format (i.e., the file you would feed to fstcompile), or, alternatively, your Thrax .grm file (if that's what you used).
  2. Your output symbol table (if applicable)
  3. Some sample terminal output, showing it at work
  4. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs that you ran into, as well as how you got around them

Hint: Start with small numbers, and work your way up.

Hint: I suggest starting with raw OpenFST for 0-999, and then moving to Thrax for the rest (it can get quite tedious otherwise).

Tip: fstprintstrings takes the same --isymbols= and --osymbols arguments as any other OpenFST program.

Tip: The Russian number-naming system is extraordinarily complex, so maybe don't choose that one? (Unless you really want to, and know enough Russian to have it not be horribly painful).

Part 2: ROT-13

ROT-13 is an example of a Caesar cipher, a category of historically-important (but functionally useless) ciphers. ROT-13 has the interesting property that it is its own inverse— i.e., ROT13(ROT13(x))=x.

Create an FST that will encipher/decipher ROT-13-encoded messages. You need only worry about the letters a-z, in both upper-case and lower-case, as well as standard ASCII punctuation (which should pass through ROT-13 unchanged).

This should be pretty easy; so, let's make things more interesting. The traditional way to break a substitution cipher is to use character frequency counts (i.e., if the letter "j" appears most often in the ciphertext, you might guess that it means "e", and so on). In an attempt to foil this, our adversary scrambles their message's vowels before enciphering with ROT-13. They do this by choosing, for each vowel, a different vowel at random. So, for example, the word "example" might be transformed into "ixumpla".

You intercept the following message:

Ule zhguve nayl fpryqbq ube she ovrat ahafvafnpry.

(The trailing period is part of the message). You know, a priori, that this is a sentence from Pride and Prejudice that has been scrambled using the above scheme (random vowel substitution followed by ROT-13). Using Thrax and OpenFST, construct a set of transducers that will recover the original sentence. You will need pride_and_prejudice_sentences.fst, which is an acceptor that will recognize any sentence from the Project Gutenberg version of Pride and Prejudice.

What to turn in:

  1. Your simple ROT-13 FST (in either OpenFST ASCII format or as a Thrax .grm file);
  2. The decoded sentence;
  3. A paragraph describing how you solved the problem, along with a sentence or two about anything you found particularly interesting, difficult, surprising, etc. about the process; and
  4. Your decryption FSTs (or Thrax files) as well as sample Terminal output showing them in use.

Part 3: T9 Text Entry

Recall that non-smartphones use a predictive text-entry system often called "t9", in which the user enters numbers on the keypad, and the phone "guesses" what words they were trying to type based on the telephone keypad code ("2" being either "a", "b", or "c", etc.). (Note: "t9" actually refers to a specific implementation of this general approach, but the name has become more generic over time). You will construct an FST that takes a sequence of digits (e.g. "228") and produces a lattice of letters; when you compose this lattice with the wordlist fst (which can be found in the spelling correction demonstration from class, or, alternatively, trivially recreated using Thrax, StringFile, and /usr/share/dict/words), you will end up with a lattice of possible English-language words those digits could represent.

Note: Your FST does not need to handle spaces. This is a single-word recognizer.

What to turn in:

  1. Your t9 (digit-to-letter) FST, in ASCII format (or as a Thrax .grm file, if that's what you used)
  2. Sample terminal output, showing it at work with several sample inputs (i.e., using fstcompilestring to enter digit sequences, and fstprintstrings to print output)
  3. A paragraph describing your approach, including anything interesting or unexpected that happened along the way; please include a sentence describing "next steps", i.e., how might you extend your transducer to make it a more accurate predictor of words?