Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.
In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.
In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.
Fig. 1: Leafcutter ants (Atta colombica) are endemic to South and Central America. Working together in very large numbers, they collect, harvest, and process leaves to use as the substrate for Lepiotaceae fungi, which they actively cultivate. This fungus forms the ants' primary food source, and the ants expend a great deal of effort to protect their fungal colony from pests and molds.
By breaking large leaves into smaller leaves, and by clever specialization and work planning, very small ants are able to farm very large amounts of fungus.
By the end of the course, students will:
A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not. Students should be comfortable reading scientific articles from the computer science literature.
CS/EE 5/655 is being taught by Steven Bedrick. He can occasionally be found in his natural habitat, Gaines Hall room 19. He has no set office hours, and GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.
We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.
The course has no textbook, however students may find some of the books listed below to be useful. Most of the assigned readings are available via the OHSU Library; we will arrange for access to the remainder.
Date | Topic | HW Assigned | HW Due |
---|---|---|---|
Mon Mar 28 | Course overview, our infrastructure | ||
Wed Mar 30 | Hadoop & Map/Reduce basics; Hadoop APIs | HW1 | |
Mon Apr 04 | Applications of MR: Inverted Indexing, Machine Translation | ||
Wed Apr 06 | Spark, etc. | Project Proposals | HW1 |
Mon Apr 11 | No class - Steven in Maryland | ||
Wed Apr 13 | Condor, Clouds, Workflow | ||
Mon Apr 18 | Analytical Tools (Pig, etc.) | Project Proposals | |
Wed Apr 20 | Neural Networks (Guest Speaker: Izhak Shafran, Google) | HW2 | |
Mon Apr 25 | Graphs 1 | ||
Wed Apr 27 | Bioinformatics Applications (Guest Speaker: Myron Peto, OHSU) | ||
Mon May 02 | Graphs 2 | Pilot Study Writeup | |
Wed May 04 | Pilot Study Presentations | HW2 | |
Mon May 09 | Distributed Math & ML | HW3 | |
Wed May 11 | Language Models (Guest Speaker: Brian Roark, Google) | ||
Mon May 16 | Collaborative Filtering | HW4 | HW3 |
Wed May 18 | Filesystems & Databases | ||
Mon May 23 | MPI | ||
Wed May 25 | Security | HW5 | HW4 |
Mon May 30 | Memorial day | ||
Wed Jun 01 | Luigi (Joel Adams) and Consensus Algorithms | ||
Mon Jun 06 | project presentations 1 | HW5 | |
Wed Jun 08 | project presentations 2 | ||
Mon Jun 13 | "Finals week" | ||
Wed Jun 15 | "Finals week" | Project writeup due |
Date | Topic | Presenter |
---|---|---|
Mar 28 | Introduction & CSLU resources | Steven |
Assigned readings:
Additional readings of interest:
Date | Topic | Presenter |
---|---|---|
Mar 30 | Hadoop & Map/Reduce Basics; Hadoop APIs | Steven |
Assigned readings:
Useful references:
Date | Topic | Presenter |
---|---|---|
Apr 04 | Applications of MR: Inverted Indexing, Machine Translation | Steven |
Assigned Readings:
Suggested Reading:
Date | Topic | Presenter |
---|---|---|
Apr 06 | Spark | Steven |
Assigned Readings:
Date | Topic | Presenter |
---|---|---|
Apr 11 | No class - SDB in Bethesda | N/A |
Date | Topic | Presenter |
---|---|---|
Apr 13 | HTCondor, Workflows, and Clouds | Steven |
Assigned Readings:
Recommended Readings
Date | Topic | Presenter |
---|---|---|
Apr 18 | Analytical Tools (Pig, etc.) | Neelay |
Assigned Readings:
Date | Topic | Presenter |
---|---|---|
Apr 20 | Neural Networks | Izhak Shafran, Google |
Slides:
Assigned Readings:
Strongly Recommended Readings:
Date | Topic | Presenter |
---|---|---|
Apr 25 | Graphs, part 1 | Anders |
Assigned Readings:
Strongly Recommended Readings:
Date | Topic | Presenter |
---|---|---|
Apr 27 | Bioinformatics Applications | Myron Peto, OHSU |
Assigned Readings:
Date | Topic | Presenter |
---|---|---|
May 2 | Graphs, part 2 | Ogi |
Assigned Readings:
Date | Topic | Presenter |
---|---|---|
May 4 | Pilot Study Presentations | Everybody! |
Date | Topic | Presenter |
---|---|---|
May 9 | Distributed Math & ML | Soe |
Assigned Readings:
Date | Topic | Presenter |
---|---|---|
May 11 | Distributed Language Modeling | Brian Roark, Google |
Assigned Readings:
TBD
Date | Topic | Presenter |
---|---|---|
May 16 | Collaborative Filtering & LDA | TBD |
Also Happening Today:
Assigned Readings:
Suggested:
Date | Topic | Presenter |
---|---|---|
May 18 | File Systems & Databases | TBD |
Assigned Readings:
Strongly Recommended:
Date | Topic | Presenter |
---|---|---|
May 23 | MPI | Rosemary |
Assigned Readings:
Useful References:
Date | Topic | Presenter |
---|---|---|
May 25 | Security | Casey |
Assigned Readings:
Suggested Readings:
Date | Topic | Presenter |
---|---|---|
June 1 | Workflow with Luigi | Guest: Joel |
Suggested Readings:
CS5/624 will be held Mondays and Wednesdays, from 4:00 to 5:30 PM, in GH5.
For this class, we will be using the CSLU "Bigbird" cluster. Ethan Van Matre is the system administrator for this cluster, and will be setting up user accounts for all students; email both him and the Steven if you have any problems with your account. Please include "CS624" in the subject line of your emails, if possible.
A subset of the bigbirds have been set aside for use by this class and loaded with a current version of Spark and Hadoop. The "head node" for this sub-cluster is bigbird61.cslu.ohsu.edu; Hadoop, etc. jobs should be run from there. You can access the administrative console here.
To connect to the cluster from on-campus, simply ssh to {your_user_name}@bigbird61.cslu.ohsu.edu. From off-campus, you'll want to connect to our gateway machine by ssh-ing to {your_user_name}@cslu.ohsu.edu, and from there connecting to bigbird61. I strongly recommend setting up an RSA keypair and using it for SSH authentication; see below for links to some useful SSH-related resources.
We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms and techniques we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in either Python or Java. If you want to use something else to do the assignment, you are of course free to do so.
The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.
The deliverables for the final project are an in-class presentation and a short paper done in the style of a conference submission: a maximum of eight pages, not counting references. The writeup will be due on June 15. Unless otherwise agreed, this will be a hard deadline, as grades are due later that week.
Your grade will be based on three things: in-class participation (including paper presentations) (30%), homework (30%), and the final project (40%).
This will be largely "seminar-style" course, and most sessions will involve student-led discussions of journal articles. Everybody should come to class having read the material, and be ready to discuss it as a group.
We will be filling these in as we go along!
Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.