CS/EE 524/624: Problem Solving with Large Clusters

Synopsis

Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.

In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.

In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.

Fig. 1: Leafcutter ants (Atta colombica) are endemic to South and Central America. Working together in very large numbers, they collect, harvest, and process leaves to use as the substrate for Lepiotaceae fungi, which they actively cultivate. This fungus forms the ants' primary food source, and the ants expend a great deal of effort to protect their fungal colony from pests and molds.

By breaking large leaves into smaller leaves, and by clever specialization and work planning, very small ants are able to farm very large amounts of fungus.

Learning objectives

By the end of the course, students will:

Be exposed to the principles behind several different modern cluster computing paradigms.
Learn strategies for choosing a cluster computing paradigm for particular kinds of data analysis problems.
Gain practice at writing sofware using the Apache Hadoop and Apache Spark environments.
Learn strategies for storing and managing very large data sets.

Prerequisites

A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not. Students should be comfortable reading scientific articles from the computer science literature.

Instructor

Fig. 2: The instructor, at Conwy Castle in Wales.

CS/EE 5/655 is being taught by Steven Bedrick. He can occasionally be found in his natural habitat, Gaines Hall room 19. He has no set office hours, and GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.

We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.

Textbook

The course has no textbook, however students may find some of the books listed below to be useful. Most of the assigned readings are available via the OHSU Library; we will arrange for access to the remainder.

Schedule

Date	Topic	HW Assigned	HW Due
Mon Mar 28	Course overview, our infrastructure
Wed Mar 30	Hadoop & Map/Reduce basics; Hadoop APIs	HW1
Mon Apr 04	Applications of MR: Inverted Indexing, Machine Translation
Wed Apr 06	Spark, etc.	Project Proposals	HW1
Mon Apr 11	No class - Steven in Maryland
Wed Apr 13	Condor, Clouds, Workflow
Mon Apr 18	Analytical Tools (Pig, etc.)		Project Proposals
Wed Apr 20	Neural Networks (Guest Speaker: Izhak Shafran, Google)	HW2
Mon Apr 25	Graphs 1
Wed Apr 27	Bioinformatics Applications (Guest Speaker: Myron Peto, OHSU)
Mon May 02	Graphs 2		Pilot Study Writeup
Wed May 04	Pilot Study Presentations		HW2
Mon May 09	Distributed Math & ML	HW3
Wed May 11	Language Models (Guest Speaker: Brian Roark, Google)
Mon May 16	Collaborative Filtering	HW4	HW3
Wed May 18	Filesystems & Databases
Mon May 23	MPI
Wed May 25	Security	HW5	HW4
Mon May 30	Memorial day
Wed Jun 01	Luigi (Joel Adams) and Consensus Algorithms
Mon Jun 06	project presentations 1		HW5
Wed Jun 08	project presentations 2
Mon Jun 13	"Finals week"
Wed Jun 15	"Finals week"		Project writeup due

Detailed Schedule

Week 1

Monday

Date	Topic	Presenter
Mar 28	Introduction & CSLU resources	Steven

Assigned readings:

Command-line tools can be 235x faster than your Hadoop cluster (Blog post), Adam Drake, Jan 2014
Relative latency of various operations (cache miss, disk access, etc.)
Haunted By Data, Maciej Cegłowski, Presentation at Strata+Hadoop World 2015.

Additional readings of interest:

Data Center As a Computer, 2nd ed., Luiz Andre Barroso, Jimmy Clidaras, Urs Hölzle, Synthesis Lectures on Computer Architecture, 2013. [pdf]
Designs, Lessons and Advice from Building Large Distributed Systems, Jeff Dean, Keynote LADIS 2009.

Wednesday

Date	Topic	Presenter
Mar 30	Hadoop & Map/Reduce Basics; Hadoop APIs	Steven

Assigned readings:

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1, pg. 107-113, 2008.
MapReduce: A Flexible Data Processing Tool, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol 53, no. 1, pg. 72-77, 2010.
Lin & Dyer, Chapters 1-3

Useful references:

Introducing Apache Hadoop..., Hortonworks Documentation
Pydoop (Python Map/Reduce API)
CS 624 Hadoop API Examples (note: please do not refer to these while doing homework 1!)

Week 2

Monday

Date	Topic	Presenter
Apr 04	Applications of MR: Inverted Indexing, Machine Translation	Steven

Assigned Readings:

Challenges in building large-scale information retrieval systems, Jeffrey Dean, Invited talk, Proc. of the Second ACM International Conference on Web Search and Data Mining, 2009.
Lin & Dyer, Chapter 4: Inverted Indexing for Text Retrieval
Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce, Christopher Dyer et. al., Proc. ACL Workshop on Statistical Machine Translation, pg. 199-207, 2008. slides
Large Language Models in Machine Translation. Thorsten Brants et. al., Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pg. 858-867, 2007.

Suggested Reading:

"Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search." Jimmy Lin, Donald Metzler, Tamer Elsayed, and Lidan Wang. Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), November 2009, Gaithersburg, Maryland.
For interesting (and, at this point, classic) discussion of what you can do with web-scale language models, consult:
- "Using the web for language independent spellchecking and autocorrection." Whitelaw, Casey, et al. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
- "Smoothing techniques for adaptive online language models: topic tracking in tweet streams." Lin, Jimmy, Rion Snow, and William Morgan. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
- "Exploring web scale language models for search query processing." Huang, Jian, et al. Proceedings of the 19th international conference on World wide web. ACM, 2010.

Wednesday

Date	Topic	Presenter
Apr 06	Spark	Steven

Assigned Readings:

Hortonworks Introduction to Spark
"Spark: Cluster Computing with Working Sets", Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
"Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." Zaharia, Matei, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pp. 10-10. USENIX Association, 2012.

Week 3

Monday

Date	Topic	Presenter
Apr 11	No class - SDB in Bethesda	N/A

Wednesday

Date	Topic	Presenter
Apr 13	HTCondor, Workflows, and Clouds	Steven

Assigned Readings:

Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
Peter Couvares, Tevik Kosar, Alain Roy, Jeff Weber and Kent Wenger, "Workflow in Condor", in In Workflows for e-Science, Editors: I.Taylor, E.Deelman, D.Gannon, M.Shields, Springer Press, January 2007
Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of the Condor batch system", Journal of Physics: Conference Series, Vol. 331, No. 6, 2011.

Recommended Readings

Nicholas Coleman, "Distributed Policy Specification and Interpretation with Classified Advertisements", Practical Aspects of Declarative Languages, Lecture Notes in Computer Science Volume 7149, 2012, pp 198-211, January 2012.

Week 4

Monday

Date	Topic	Presenter
Apr 18	Analytical Tools (Pig, etc.)	Neelay

Assigned Readings:

Gates, Alan F., Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. "Building a high-level dataflow system on top of Map-Reduce: the Pig experience." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1414-1425.
Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. "Dremel: interactive analysis of web-scale datasets." Proceedings of the VLDB Endowment 3, no. 1-2 (2010): 330-339.
Pike, Rob, Sean Dorward, Robert Griesemer, and Sean Quinlan. "Interpreting the data: Parallel analysis with Sawzall." Scientific Programming 13, no. 4 (2005): 277-298.

Wednesday

Date	Topic	Presenter
Apr 20	Neural Networks	Izhak Shafran, Google

Slides:

Assigned Readings:

Abadi, Martin, et al. "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems" Arxiv Preprint arXiv:1603.04467, Mar 2016.
Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao et al. "Large Scale Distributed Deep Networks." In NIPS 2012, pp. 1232-1240.
Georg Heigold, Erik McDermott, Vincent Vanhoucke, Andrew Senior, Michiel Bacchiani. "Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks" in Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

Strongly Recommended Readings:

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space". In Proceedings of Workshops at ICLR, 2013.
Ian J. Goodfellow; Yaroslav Bulatov; Julian Ibarz; Sacha Arnoud; Vinay Shet. "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" In Proceedings of Workshops at ICLR, 2014.
Coates, Adam, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. "Deep learning with COTS HPC systems." In Proceedings of The 30th International Conference on Machine Learning, pp. 1337-1345. 2013.
TensorFlow Tutorials
TensorFlow Playground

Week 5

Monday

Date	Topic	Presenter
Apr 25	Graphs, part 1	Anders

Assigned Readings:

"Graph Twiddling in a MapReduce World", Jonathan Cohen, Computing in Science and Engineering, vol. 11, no. 4, pg. 29-41
Stanton, Isabelle, and Gabriel Kliot. "Streaming graph partitioning for large distributed graphs." In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222-1230. ACM, 2012.
Kollias, Giorgos, Madan Sathe, Olaf Schenk, and Ananth Grama. "Fast parallel algorithms for graph similarity and matching." Journal of Parallel and Distributed Computing (2014).

Strongly Recommended Readings:

"Link Mining: A Survey", Lise Getoor et. al., 7(2):3-12, SIGKDD Exploratory Newsletter, 2005.
"Processing large-scale graph data: A guide to current technology.", Sherif Sakr, IBM developerWorks, June 2013.

Wednesday

Date	Topic	Presenter
Apr 27	Bioinformatics Applications	Myron Peto, OHSU

Assigned Readings:

Mark A DePristo, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data Nature Genetics 43, 491–498 (2011)
Yu-Ting Gchen, et al. CS-BWAMEM: A fast and scalable read aligner using cloud-computing technology Presented at HiTSeq '15 (whitepaper)

Week 6

Monday

Date	Topic	Presenter
May 2	Graphs, part 2	Ogi

Assigned Readings:

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999).
Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
"Pregel: A System for Large-Scale Graph Processing", Grzegorz Malewicz et. al., Proc. International Conference on Management of Data (SIGMOD), no. 12, pg. 135-146, 2010.

Wednesday

Date	Topic	Presenter
May 4	Pilot Study Presentations	Everybody!

Week 7

Monday

Date	Topic	Presenter
May 9	Distributed Math & ML	Soe

Assigned Readings:

"Distributed Training Strategies for the Structured Perceptron", Ryan McDonald et. al., Proc NAACL, 2010.
"Scalable k-means++." Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633.
"Stochastic Gradient Boosted Decision Trees", Jerry Le et. al., Proc. ACM Conference on Information and Knowledge Management (CIKM), no. 4, pg. 2061-2064, 2009.
"Distributed matrix factorization with mapreduce using a series of broadcast-joins." Schelter, Sebastian, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl. In Proceedings of the 7th ACM conference on Recommender systems, pp. 281-284. ACM, 2013.

Wednesday

Date	Topic	Presenter
May 11	Distributed Language Modeling	Brian Roark, Google

Assigned Readings:

TBD

Week 8

Monday

Date	Topic	Presenter
May 16	Collaborative Filtering & LDA	TBD

Also Happening Today:

One more pilot presentation!
Discussion of distributed matrix factorization article from 5/9

Assigned Readings:

Schelter, Sebastian, Christoph Boden, and Volker Markl. "Scalable similarity-based neighborhood methods with mapreduce." In Proceedings of the sixth ACM conference on Recommender systems, pp. 163-170. ACM, 2012.
"Asynchronous Distributed Learning of Topic Models", Arthur Asuncion et. al., Proc. Neural Information Processing Systems (NIPS), pg. 81-88, 2008.

Suggested:

"Distributed Inference for Latent Dirichlet Allocation", David Newman et. al., Proc Neural Information Processing Systems (NIPS), vol. 20, pg. 1081-1088, 2007.

Wednesday

Date	Topic	Presenter
May 18	File Systems & Databases	TBD

Assigned Readings:

BigTable: A Distributed storage system for structured data Fay Chang et. al., Proc. Symposium on Operating System Design and Implementation (OSDI), pages 205-218, 2006.
Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 8. (Joel)

Strongly Recommended:

Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
The Google File System, Sanjay Ghemawat et. al., ACM Symposium on Operating Systems Principles (SOSP), pg. 29-43, 2003.

Week 9

Monday

Date	Topic	Presenter
May 23	MPI	Rosemary

Assigned Readings:

LLNL MPI Tutorial

Useful References:

Wednesday

Date	Topic	Presenter
May 25	Security	Casey

Assigned Readings:

Demchenko, et al. "Big Security for Big Data: Addressing Security Challenges for the Big Data Infrastructure" Proceedings of the Secure Data Management workshop at VLDB '13.
Adam Fuchs. "Accumulo – Extensions to Google's Bigtable Design" Talk given at Morgan State University, 2012
- Pay particular attention to the section on the cell-level security model
Gadepally, et al. "Computing on masked data: a high performance method for improving big data veracity" Proc HPEC 2014
- arXiv pdf link
Baumann, et al. "Shielding Applications from an Untrusted Cloud with Haven" ACM TOCS 33(3) 8:1-26, 2015

Suggested Readings:

Derbeko, et al. "Security and Privacy Aspects in MapReduce on Clouds: A Survey" arXiv:1605.00677, 5/2016
For a description of the D4M abstract schema referred to in this article, see: Kepner et al. "D4M 2.0 schema: A general purpose high performance schema for the Accumulo database" Proc HPEC 2013 (arXiv)
Brandon Pontius. "Information security considerations for applications using Apache Accumulo" MS Thesis, Naval Postgraduate School, 2014
Popa, et al. "CryptDB: protecting confidentiality with encrypted query processing" Proc SOSP 2011, pp. 85-100.

Wednesday

Date	Topic	Presenter
June 1	Workflow with Luigi	Guest: Joel

Suggested Readings:

Logistics

Dates and Times

CS5/624 will be held Mondays and Wednesdays, from 4:00 to 5:30 PM, in GH5.

Technical Matters

For this class, we will be using the CSLU "Bigbird" cluster. Ethan Van Matre is the system administrator for this cluster, and will be setting up user accounts for all students; email both him and the Steven if you have any problems with your account. Please include "CS624" in the subject line of your emails, if possible.

A subset of the bigbirds have been set aside for use by this class and loaded with a current version of Spark and Hadoop. The "head node" for this sub-cluster is bigbird61.cslu.ohsu.edu; Hadoop, etc. jobs should be run from there. You can access the administrative console here.

To connect to the cluster from on-campus, simply ssh to {your_user_name}@bigbird61.cslu.ohsu.edu. From off-campus, you'll want to connect to our gateway machine by ssh-ing to {your_user_name}@cslu.ohsu.edu, and from there connecting to bigbird61. I strongly recommend setting up an RSA keypair and using it for SSH authentication; see below for links to some useful SSH-related resources.

Homework

We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms and techniques we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in either Python or Java. If you want to use something else to do the assignment, you are of course free to do so.

The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.

The deliverables for the final project are an in-class presentation and a short paper done in the style of a conference submission: a maximum of eight pages, not counting references. The writeup will be due on June 15. Unless otherwise agreed, this will be a hard deadline, as grades are due later that week.

Grading

Your grade will be based on three things: in-class participation (including paper presentations) (30%), homework (30%), and the final project (40%).

This will be largely "seminar-style" course, and most sessions will involve student-led discussions of journal articles. Everybody should come to class having read the material, and be ready to discuss it as a group.

Resources

Useful books

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan Claypool, 2010.

Websites of note

We will be filling these in as we go along!

Tools

Sidestep (useful for setting up an SSH proxy for off-campus access to the cluster)

Student Access Statement

Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.

CS/EE 5/624: Problem Solving with Large Clusters

Spring 2016, Mondays & Wednesdays at 4:00

Synopsis

Learning objectives

Prerequisites

Instructor

Textbook

Schedule

Detailed Schedule

Week 1

Monday

Wednesday

Week 2

Monday

Wednesday

Week 3

Monday

Wednesday

Week 4

Monday

Wednesday

Week 5

Monday

Wednesday

Week 6

Monday

Wednesday

Week 7

Monday

Wednesday

Week 8

Monday

Wednesday

Week 9

Monday

Wednesday

Wednesday

Logistics

Dates and Times

Technical Matters

Homework

Grading

Resources

Useful books

Websites of note

Tools

Student Access Statement