CS/EE 555/655: Analyzing Sequences

Fall 2014, Mondays & Wednesdays at 4:00


Synopsis

Many types of data occur sequentially. A conversation between two people can be thought of as a sequence of speech turns; a written sentence can be thought of as a sequence of words and punctuation symbols, while a spoken sentence can be thought of as a sequence of phonemes. Time-series data (weather observations, stock tickers, etc.) is inherently sequential, and, of course, molecular biology is replete with examples of sequential data (sequences of nucleotide bases in DNA and RNA, amino acids in protein strands, etc.).

This course will concern itself with methods for analyzing these kinds of data, with a particular emphasis on sequences derived from linguistic data. Topics will include:

While we will be primarily focused on written language, we may dip our toes into speech here and there.


Fig. 1: xkcd #851, "Na", by Randall Munroe. An example of an FSA describing the surprisingly regular language of several notable songs.

Prerequisites

The course will include several hands-on assignments in which students will implement various algorithms and run them on real-world data. As such, students must be very comfortable with basic programming (in the language of their choice). What does this mean? It means that any graduate student in Computer Science or Electrical Engineering should have the appropriate background for this course. For students from other backgrounds, consider the following description of a hypothetical programming assignment:

"Write a program to traverse a directory hierarchy containing the various plays of Shakespeare, and generate a report that lists which words are most over-represented in the tragedies as compared to the comedies."

Can you do this? If so, that means that you know about file I/O and some basic data structures (lists, dictionaries/hashes, etc.). You are probably good to go for the course, as far as the programming piece is concerned.

Another important pre-requisite is the ability to work comfortably with a UNIX-style command line. You don't need to know about every command ever written, but you should be at least a little bit familiar with how to navigate a directory hierarchy, create, copy, and move files, run programs, and pipe IO from one program to another.

In addition to being comfortable with this the above, students should also be comfortable working with probabilities, statistics, and algebra. Specific concepts to know about include: joint and conditional probabilities, Bayes' Theorem, random variables drawn from specific probability distributions (e.g. the Gaussian distribution), logarithms and log-probabilities, etc. Some past exposure to linear algebra will also be extremely useful for certain parts of the course.


Instructor

Fig. 2: The instructor, next to a large sculpture of the majestic Crotaphytus bicintores.

CS/EE 5/655 is being taught by Steven Bedrick. He can usually be found in his natural habitat, Gaines Hall room 19. While he has no set office hours, GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.

We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.


Textbook

We will be using Jurafsky & Martin's "Speech and Language Processing", 2nd edition as our primary text in this course, but we will also probably use readings from other sources as well. New copies of J&M are absurdly expensive, but well worth the investment, as you will find this book necessary and useful for future CSLU classes. You may be able to find used copies online at a significant discount.

Note that many of the electronic copies of J&M floating around the Internet, besides being of questionable provenance, are of the first edition— which contains numerous errata and less material than does the second edition.


Schedule

Date Topic Reading HW Assigned
Sep 29 (M) Introduction    
Oct 1 (W) Regular Expressions J&M Chap. 2 HW1 - Regexes
Oct 6 (M) FSTs, part 1    
Oct 8 (W) FSTs, part 2 [in-class FST examples] Sproat & Roark Chapter 1. HW2 - Number Name generation, ROT-13, & T9
Oct 13 (M) LMs    
Oct 15 (W) LMs & Smoothing    
Oct 20 (M) String Similarity    
  Edit distance - Levenshtein - Jaccard - Jaro-Winkler - Wagner-Fischer Algorithm - Back-door way to introduce Dynamic Programming;
Oct 22 (W) Discrete HMM    
Oct 27 (M) Viterbi & Baum-Welch J&M 5.5, 6.1-6.4 HW3 - Implement simple Viterbi tagger
Oct 29 (W) VQ & GMMs    
Nov 3 (M) EM algorithm    
Nov 5 (W) GMM HMMs    
Nov 10 (M) Speaker Diarization Case Study   HW4 - Diarization
Nov 12 (W) String Matching - Literal (1) [supplementary slides: Naïve algorithm, Improved naïve algorith, Linear-time "Z" algorithm] Gusfield Chapter1, Add'l Resources  
Nov 17 (M) String Matching - Literal (2) Gusfield 1 & 2  
  [supplementary slides: roark_kmp.pdf roark_boyer_moore.pdf]
Nov 19 (W) String Matching - Approximate [Aho-Corasick, Baeza-Yates & Perleberg] Gusfield Chapters 3.4.. & 12  
Nov 24 (M) String Alignment    
Nov 26 (W) Suffix Trees & Arrays    
Dec 1 (M) More Fun and games with suffix trees   HW5 - Boyer-Moore (& KMP)
Dec 3 (W) Tries and Levenshtein Automata Add'l Reading  
Dec 8 (M) Joel: Information Theory & Cross-entropy; Shiran: Coding & Compression    
Dec 10 (W) HW5 presentations!   HW6 - Tries & Levenshtein Automata

Logistics

CS655 will be held Mondays and Wednesdays, from 4:00 to 5:30 PM, in GH5.


Homework

We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in Python. If you want to use something else to do the assignment, you are of course free to do so.

Some assignments will, however, make use of Python libraries to do some of the heavy lifting. When this is the case, the assignment description will include notes on which libraries are needed and how to install them. Please make your life easier by installing those libraries at the very earliest opportunity, since some of them are known to be a bit finicky to get running.

The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.


Grading

There will be no final project, nor will there be a final exam. This means that most of your grade (70%) in the course will be from the homework assignments. The remaining 30% will be based on class participation. What does this mean? It means attending class, asking questions during lectures, coming to talk to me about homework assignments, and so on.

Resources

Useful books

Websites of note


Student Access Statement

Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.