Problem Solving with Large Clusters
Steven Bedrick, CSLU, OHSU

News and Assignments   Course Information   Lectures and Readings   Homework  Textbooks   Schedule   Links

News and Assignments

Course Information

Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.

In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.

In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.

Prerequisites: A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not.

Grading: Students will be graded as follows: 20% final project; 20% assignments; 60% participation (including paper presentations). This will largely be a "seminar-style" course, and most sessions will revolve around student-led discussions of papers. Everybody should come to class having read the papers and prepared to participate in the discussion.

Final Project:As noted, the final project will make up 20% of the course grade. The project will be on a subject of the student's choosing (in consultation with the instructor), and will require both a written report and a final presentation. If possible, the source code, etc. for the project should be publicly releasable. Student projects in previous iterations of this course have resulted in publications, so don't be afraid to think big!

Lectures and Readings

Note that this schedule is currently somewhat "loose", and will shift depending on guest lecturer availability.
Day Date Topic Presenters
Monday 3/31 Course Overview, Infrastructure AT CSLU Steve

Wednesday 4/2 Inverted Indexing, MT Language Models, Multicore MapReduce Joel, Shiran
For further reading:
Monday 4/7 More MT, Document Similarity, Topic Modeling Alireza & Mahsa

Wednesday 4/9 Cancelled! N/A

Wednesday 4/16 Distributed approaches to linear algebra Jesse & Joseph
Additional reading (highly recommended!):
  • Ensemble Nystrom Method, S. Kumar et. al., Proc. Neural Information Processing Systems (NIPS), 2010, Winner of Best Student Paper at the New York Academy of Sciences 2009 Symposium on ML.
  • Parallel Spectral Clustering, Wen-Yen Chen et. al., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [code]

Friday 4/18 Machine Learning: perceptron training, parallel svm, boosted decision trees Shiran, Mahsa, Archana

Monday 4/21 Collaborative Filtering, LDA, Mahout tutorial Joseph, Golnar
Additional Reading:
Wednesday 4/23 Graphs 1 Joel, Mahsa
Additional background reading (strongly recommended):
Monday 4/28 Graphs 2 Shiran, Golnar, Jesse
Additional reading:
Wednesday 4/30 Distributed database algorithms (and K-means++) Archana & Joel
  • Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. "Scalable k-means++." Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633. (Presenter: Archana)
  • Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 8. (Joel)
Additional Reading:
  • Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (highly recommended!)
  • Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. "Hive-a petabyte scale data warehouse using hadoop." In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 996-1005. IEEE, 2010.

Monday 5/5 Distributed data stores (BigTable, Hive, etc.) Joseph, Golnar

Friday 5/9 Guest Speaker: Izhak Shafran, PhD (Google) TBA
Additional Reading:
Monday 5/12 Distributed data analysis tools Joel, Golnar, Jesse

Wednesday 5/14 Distributed File Systems Keith Mannthey, Intel Corp.

Wednesday 5/21 Genomics Guest Speaker: Myron Peto, PhD (Spellman Lab, OHSU)

Friday 5/23 Guest Speaker: Kyle Ambert, PhD (Intel Corp.) [slides]

Monday 5/26 No class- Memorial day! TBA

Wednesday 5/28 Condor workflows, MPI Shiran
Additional Condor reading:
Monday 6/2 Consensus algorithms & other alternative approaches (Steve out of town) TBA
  • XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Nonparametric Decentralized Detection using Kernel Methods, IEEE Transactions on Signal Processing, pp 4053-4066, 2005; Outstanding ICML student paper award. (Jesse)
  • Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (yes, it was "recommended" before, but we're going to do it for real this time) (Mahsa)

Friday 6/6 Streaming/"Realtime" approaches TBA
Additional reading:
Monday 6/9 Xeon PHI, etc. Michael Julier (Intel Corp.)
Wednesday 6/11 "Cloud" clusters— EC2, StarCluster, PiCloud; setup & administration TBA
Monday 6/16 Project presentations! TBA
Wednesday 6/18 Project presentations! TBA


Assignment Assigned Due
Homework 1: Word counting, etc. 3/31/14 4/6/14

Textbook and Other Useful Resources

Note: For recently developed techniques, we will rely on selected papers, which will be provided in required readings.


Meetings Generally, M-W 1600-1730 (some weeks we may meet on Friday instead of Monday or Wednesday)
Venue SoN 116
Office hours By appointment (request by email)


Relevant Software Tools & Other Resources
Note: This list, while useful, is out of date. We will update it throughout the course!