CS/EE 5/624: Problem Solving with Large Clusters

Spring 2016, Mondays & Wednesdays at 4:00


Synopsis

Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.

In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.

In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.


Fig. 1: Leafcutter ants (Atta colombica) are endemic to South and Central America. Working together in very large numbers, they collect, harvest, and process leaves to use as the substrate for Lepiotaceae fungi, which they actively cultivate. This fungus forms the ants' primary food source, and the ants expend a great deal of effort to protect their fungal colony from pests and molds.

By breaking large leaves into smaller leaves, and by clever specialization and work planning, very small ants are able to farm very large amounts of fungus.

Learning objectives

By the end of the course, students will:

  1. Be exposed to the principles behind several different modern cluster computing paradigms.
  2. Learn strategies for choosing a cluster computing paradigm for particular kinds of data analysis problems.
  3. Gain practice at writing sofware using the Apache Hadoop and Apache Spark environments.
  4. Learn strategies for storing and managing very large data sets.


Prerequisites

A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not. Students should be comfortable reading scientific articles from the computer science literature.


Instructor

Fig. 2: The instructor, at Conwy Castle in Wales.

CS/EE 5/655 is being taught by Steven Bedrick. He can occasionally be found in his natural habitat, Gaines Hall room 19. He has no set office hours, and GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.

We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.


Textbook

The course has no textbook, however students may find some of the books listed below to be useful. Most of the assigned readings are available via the OHSU Library; we will arrange for access to the remainder.


Schedule

Date Topic HW Assigned HW Due
Mon Mar 28 Course overview, our infrastructure
Wed Mar 30 Hadoop & Map/Reduce basics; Hadoop APIs HW1
Mon Apr 04 Applications of MR: Inverted Indexing, Machine Translation
Wed Apr 06 Spark, etc. Project Proposals HW1
Mon Apr 11 No class - Steven in Maryland
Wed Apr 13 Condor, Clouds, Workflow
Mon Apr 18 Analytical Tools (Pig, etc.) Project Proposals
Wed Apr 20 Neural Networks (Guest Speaker: Izhak Shafran, Google) HW2
Mon Apr 25 Graphs 1
Wed Apr 27 Bioinformatics Applications (Guest Speaker: Myron Peto, OHSU)
Mon May 02 Graphs 2 Pilot Study Writeup
Wed May 04 Pilot Study Presentations HW2
Mon May 09 Distributed Math & ML HW3
Wed May 11 Language Models (Guest Speaker: Brian Roark, Google)
Mon May 16 Collaborative Filtering HW4 HW3
Wed May 18 Filesystems & Databases
Mon May 23 MPI
Wed May 25 Security HW5 HW4
Mon May 30 Memorial day
Wed Jun 01 Luigi (Joel Adams) and Consensus Algorithms
Mon Jun 06 project presentations 1 HW5
Wed Jun 08 project presentations 2
Mon Jun 13 "Finals week"
Wed Jun 15 "Finals week" Project writeup due

Detailed Schedule

Week 1

Monday
Date Topic Presenter
Mar 28 Introduction & CSLU resources Steven

Assigned readings:

Additional readings of interest:


Wednesday
Date Topic Presenter
Mar 30 Hadoop & Map/Reduce Basics; Hadoop APIs Steven

Assigned readings:

Useful references:


Week 2

Monday
Date Topic Presenter
Apr 04 Applications of MR: Inverted Indexing, Machine Translation Steven

Assigned Readings:

Suggested Reading:


Wednesday
Date Topic Presenter
Apr 06 Spark Steven

Assigned Readings:


Week 3

Monday
Date Topic Presenter
Apr 11 No class - SDB in Bethesda N/A

Wednesday
Date Topic Presenter
Apr 13 HTCondor, Workflows, and Clouds Steven

Assigned Readings:

Recommended Readings


Week 4

Monday
Date Topic Presenter
Apr 18 Analytical Tools (Pig, etc.) Neelay

Assigned Readings:


Wednesday
Date Topic Presenter
Apr 20 Neural Networks Izhak Shafran, Google

Slides:

Assigned Readings:

Strongly Recommended Readings:


Week 5

Monday
Date Topic Presenter
Apr 25 Graphs, part 1 Anders

Assigned Readings:

Strongly Recommended Readings:


Wednesday
Date Topic Presenter
Apr 27 Bioinformatics Applications Myron Peto, OHSU

Assigned Readings:


Week 6

Monday
Date Topic Presenter
May 2 Graphs, part 2 Ogi

Assigned Readings:


Wednesday
Date Topic Presenter
May 4 Pilot Study Presentations Everybody!

Week 7

Monday
Date Topic Presenter
May 9 Distributed Math & ML Soe

Assigned Readings:


Wednesday
Date Topic Presenter
May 11 Distributed Language Modeling Brian Roark, Google

Assigned Readings:

TBD


Week 8

Monday
Date Topic Presenter
May 16 Collaborative Filtering & LDA TBD

Also Happening Today:

  • One more pilot presentation!
  • Discussion of distributed matrix factorization article from 5/9

Assigned Readings:

Suggested:


Wednesday
Date Topic Presenter
May 18 File Systems & Databases TBD

Assigned Readings:

Strongly Recommended:

  • Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
  • The Google File System, Sanjay Ghemawat et. al., ACM Symposium on Operating Systems Principles (SOSP), pg. 29-43, 2003.

Week 9

Monday
Date Topic Presenter
May 23 MPI Rosemary

Assigned Readings:

Useful References:


Wednesday
Date Topic Presenter
May 25 Security Casey

Assigned Readings:

Suggested Readings:


Logistics

Dates and Times

CS5/624 will be held Mondays and Wednesdays, from 4:00 to 5:30 PM, in GH5.

Technical Matters

For this class, we will be using the CSLU "Bigbird" cluster. Ethan Van Matre is the system administrator for this cluster, and will be setting up user accounts for all students; email both him and the Steven if you have any problems with your account. Please include "CS624" in the subject line of your emails, if possible.

A subset of the bigbirds have been set aside for use by this class and loaded with a current version of Spark and Hadoop. The "head node" for this sub-cluster is bigbird61.cslu.ohsu.edu; Hadoop, etc. jobs should be run from there. You can access the administrative console here.

To connect to the cluster from on-campus, simply ssh to {your_user_name}@bigbird61.cslu.ohsu.edu. From off-campus, you'll want to connect to our gateway machine by ssh-ing to {your_user_name}@cslu.ohsu.edu, and from there connecting to bigbird61. I strongly recommend setting up an RSA keypair and using it for SSH authentication; see below for links to some useful SSH-related resources.


Homework

We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms and techniques we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in either Python or Java. If you want to use something else to do the assignment, you are of course free to do so.

The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.

The deliverables for the final project are an in-class presentation and a short paper done in the style of a conference submission: a maximum of eight pages, not counting references. The writeup will be due on June 15. Unless otherwise agreed, this will be a hard deadline, as grades are due later that week.


Grading

Your grade will be based on three things: in-class participation (including paper presentations) (30%), homework (30%), and the final project (40%).

This will be largely "seminar-style" course, and most sessions will involve student-led discussions of journal articles. Everybody should come to class having read the material, and be ready to discuss it as a group.


Resources

Useful books

Websites of note

We will be filling these in as we go along!

Tools


Student Access Statement

Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.