Synopsis
Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.
In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.
In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.
Fig. 1: Leafcutter ants (Atta colombica) are endemic to South and Central America. Working together in very large numbers, they collect, harvest, and process leaves to use as the substrate for Lepiotaceae fungi, which they actively cultivate. This fungus forms the ants' primary food source, and the ants expend a great deal of effort to protect their fungal colony from pests and molds.
By breaking large leaves into smaller leaves, and by clever specialization and work planning, very small ants are able to farm very large amounts of fungus.
Learning objectives
By the end of the course, students will:
- Be exposed to the principles behind several different modern cluster computing paradigms.
- Learn strategies for choosing a cluster computing paradigm for particular kinds of data analysis problems.
- Gain practice at writing sofware using the Apache Hadoop and Apache Spark environments.
- Learn strategies for storing and managing very large data sets.
Prerequisites
A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not. Students should be comfortable reading scientific articles from the computer science literature.
Instructor
CS/EE 5/655 is being taught by Steven Bedrick. He can occasionally be found in his natural habitat, Gaines Hall room 19. He has no set office hours, and GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.
We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.
Textbook
The course has no textbook, however students may find some of the books listed below to be useful. Most of the assigned readings are available via the OHSU Library; we will arrange for access to the remainder.
Schedule
Date |
Topic |
HW Assigned |
HW Due |
Mon Mar 28 |
Course overview, our infrastructure |
|
|
Wed Mar 30 |
Hadoop & Map/Reduce basics; Hadoop APIs |
HW1
|
|
Mon Apr 04 |
Applications of MR: Inverted Indexing, Machine Translation |
|
|
Wed Apr 06 |
Spark, etc. |
Project Proposals
|
HW1 |
Mon Apr 11 |
No class - Steven in Maryland |
|
|
Wed Apr 13 |
Condor, Clouds, Workflow |
|
|
Mon Apr 18 |
Analytical Tools (Pig, etc.) |
|
Project Proposals |
Wed Apr 20 |
Neural Networks (Guest Speaker: Izhak Shafran, Google) |
HW2
|
|
Mon Apr 25 |
Graphs 1 |
|
|
Wed Apr 27 |
Bioinformatics Applications (Guest Speaker: Myron Peto, OHSU) |
|
|
Mon May 02 |
Graphs 2 |
|
Pilot Study Writeup |
Wed May 04 |
Pilot Study Presentations |
|
HW2 |
Mon May 09 |
Distributed Math & ML |
HW3
|
|
Wed May 11 |
Language Models (Guest Speaker: Brian Roark, Google) |
|
|
Mon May 16 |
Collaborative Filtering |
HW4
|
HW3 |
Wed May 18 |
Filesystems & Databases |
|
|
Mon May 23 |
MPI |
|
|
Wed May 25 |
Security |
HW5
|
HW4 |
Mon May 30 |
Memorial day |
|
|
Wed Jun 01 |
Luigi (Joel Adams) and Consensus Algorithms |
|
|
Mon Jun 06 |
project presentations 1 |
|
HW5 |
Wed Jun 08 |
project presentations 2 |
|
|
Mon Jun 13 |
"Finals week" |
|
|
Wed Jun 15 |
"Finals week" |
|
Project writeup due |
Detailed Schedule
Week 1
Monday
Assigned readings:
Additional readings of interest:
Wednesday
Assigned readings:
- MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1, pg. 107-113, 2008.
- MapReduce: A Flexible Data Processing Tool, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol 53, no. 1, pg. 72-77, 2010.
- Lin & Dyer, Chapters 1-3
Useful references:
Week 2
Monday
Assigned Readings:
- Challenges in building large-scale information retrieval systems, Jeffrey Dean, Invited talk, Proc. of the Second ACM International Conference on Web Search and Data Mining, 2009.
- Lin & Dyer, Chapter 4: Inverted Indexing for Text Retrieval
- Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce, Christopher Dyer et. al., Proc. ACL Workshop on Statistical Machine Translation, pg. 199-207, 2008. slides
- Large Language Models in Machine Translation. Thorsten Brants et. al., Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pg. 858-867, 2007.
Suggested Reading:
- "Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search." Jimmy Lin, Donald Metzler, Tamer Elsayed, and Lidan Wang. Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), November 2009, Gaithersburg, Maryland.
- For interesting (and, at this point, classic) discussion of what you can do with web-scale language models, consult:
- "Using the web for language independent spellchecking and autocorrection." Whitelaw, Casey, et al. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
- "Smoothing techniques for adaptive online language models: topic tracking in tweet streams." Lin, Jimmy, Rion Snow, and William Morgan. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
- "Exploring web scale language models for search query processing." Huang, Jian, et al. Proceedings of the 19th international conference on World wide web. ACM, 2010.
Wednesday
Date |
Topic |
Presenter |
Apr 06 |
Spark |
Steven |
Assigned Readings:
- Hortonworks Introduction to Spark
- "Spark: Cluster Computing with Working Sets", Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
- "Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." Zaharia, Matei, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pp. 10-10. USENIX Association, 2012.
Week 3
Monday
Date |
Topic |
Presenter |
Apr 11 |
No class - SDB in Bethesda |
N/A |
Wednesday
Assigned Readings:
- Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
- Peter Couvares, Tevik Kosar, Alain Roy, Jeff Weber and Kent Wenger, "Workflow in Condor", in In Workflows for e-Science, Editors: I.Taylor, E.Deelman, D.Gannon, M.Shields, Springer Press, January 2007
- Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of the Condor batch system", Journal of Physics: Conference Series, Vol. 331, No. 6, 2011.
Recommended Readings
Week 4
Monday
Date |
Topic |
Presenter |
Apr 18 |
Analytical Tools (Pig, etc.) |
Neelay |
Assigned Readings:
- Gates, Alan F., Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. "Building a high-level dataflow system on top of Map-Reduce: the Pig experience." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1414-1425.
- Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. "Dremel: interactive analysis of web-scale datasets." Proceedings of the VLDB Endowment 3, no. 1-2 (2010): 330-339.
- Pike, Rob, Sean Dorward, Robert Griesemer, and Sean Quinlan. "Interpreting the data: Parallel analysis with Sawzall." Scientific Programming 13, no. 4 (2005): 277-298.
Wednesday
Date |
Topic |
Presenter |
Apr 20 |
Neural Networks |
Izhak Shafran, Google |
Slides:
Assigned Readings:
- Abadi, Martin, et al. "TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems" Arxiv Preprint arXiv:1603.04467, Mar 2016.
- Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao et al. "Large Scale Distributed Deep Networks." In NIPS 2012, pp. 1232-1240.
- Georg Heigold, Erik McDermott, Vincent Vanhoucke, Andrew Senior, Michiel Bacchiani. "Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks" in Proceedings of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Strongly Recommended Readings:
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space". In Proceedings of Workshops at ICLR, 2013.
- Ian J. Goodfellow; Yaroslav Bulatov; Julian Ibarz; Sacha Arnoud; Vinay Shet. "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" In Proceedings of Workshops at ICLR, 2014.
- Coates, Adam, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. "Deep learning with COTS HPC systems." In Proceedings of The 30th International Conference on Machine Learning, pp. 1337-1345. 2013.
- TensorFlow Tutorials
- TensorFlow Playground
Week 5
Monday
Date |
Topic |
Presenter |
Apr 25 |
Graphs, part 1 |
Anders |
Assigned Readings:
- "Graph Twiddling in a MapReduce World", Jonathan Cohen, Computing in Science and Engineering, vol. 11, no. 4, pg. 29-41
- Stanton, Isabelle, and Gabriel Kliot. "Streaming graph partitioning for large distributed graphs." In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222-1230. ACM, 2012.
- Kollias, Giorgos, Madan Sathe, Olaf Schenk, and Ananth Grama. "Fast parallel algorithms for graph similarity and matching." Journal of Parallel and Distributed Computing (2014).
Strongly Recommended Readings:
Wednesday
Date |
Topic |
Presenter |
Apr 27 |
Bioinformatics Applications |
Myron Peto, OHSU |
Assigned Readings:
Week 6
Monday
Date |
Topic |
Presenter |
May 2 |
Graphs, part 2 |
Ogi |
Assigned Readings:
- Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999).
- Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.
- "Pregel: A System for Large-Scale Graph Processing", Grzegorz Malewicz et. al., Proc. International Conference on Management of Data (SIGMOD), no. 12, pg. 135-146, 2010.
Wednesday
Date |
Topic |
Presenter |
May 4 |
Pilot Study Presentations |
Everybody! |
Week 7
Monday
Date |
Topic |
Presenter |
May 9 |
Distributed Math & ML |
Soe |
Assigned Readings:
- "Distributed Training Strategies for the Structured Perceptron", Ryan McDonald et. al., Proc NAACL, 2010.
- "Scalable k-means++." Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633.
- "Stochastic Gradient Boosted Decision Trees", Jerry Le et. al., Proc. ACM Conference on Information and Knowledge Management (CIKM), no. 4, pg. 2061-2064, 2009.
- "Distributed matrix factorization with mapreduce using a series of broadcast-joins." Schelter, Sebastian, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl. In Proceedings of the 7th ACM conference on Recommender systems, pp. 281-284. ACM, 2013.
Wednesday
Date |
Topic |
Presenter |
May 11 |
Distributed Language Modeling |
Brian Roark, Google |
Assigned Readings:
TBD
Week 8
Monday
Date |
Topic |
Presenter |
May 16 |
Collaborative Filtering & LDA |
TBD |
Also Happening Today:
- One more pilot presentation!
- Discussion of distributed matrix factorization article from 5/9
Assigned Readings:
Suggested:
Wednesday
Date |
Topic |
Presenter |
May 18 |
File Systems & Databases |
TBD |
Assigned Readings:
- BigTable: A Distributed storage system for structured data Fay Chang et. al., Proc. Symposium on Operating System Design and Implementation (OSDI), pages 205-218, 2006.
- Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 8. (Joel)
Strongly Recommended:
- Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
- The Google File System, Sanjay Ghemawat et. al., ACM Symposium on Operating Systems Principles (SOSP), pg. 29-43, 2003.
Week 9
Monday
Date |
Topic |
Presenter |
May 23 |
MPI |
Rosemary |
Assigned Readings:
Useful References:
Wednesday
Date |
Topic |
Presenter |
May 25 |
Security |
Casey |
Assigned Readings:
Suggested Readings:
Wednesday
Date |
Topic |
Presenter |
June 1 |
Workflow with Luigi |
Guest: Joel |
Suggested Readings:
Logistics
Dates and Times
CS5/624 will be held Mondays and Wednesdays, from 4:00 to 5:30 PM, in GH5.
Technical Matters
For this class, we will be using the CSLU "Bigbird" cluster. Ethan Van Matre is the system administrator for this cluster, and will be setting up user accounts for all students; email both him and the Steven if you have any problems with your account. Please include "CS624" in the subject line of your emails, if possible.
A subset of the bigbirds have been set aside for use by this class and loaded with a current version of Spark and Hadoop. The "head node" for this sub-cluster is bigbird61.cslu.ohsu.edu; Hadoop, etc. jobs should be run from there. You can access the administrative console here.
To connect to the cluster from on-campus, simply ssh to {your_user_name}@bigbird61.cslu.ohsu.edu. From off-campus, you'll want to connect to our gateway machine by ssh-ing to {your_user_name}@cslu.ohsu.edu, and from there connecting to bigbird61. I strongly recommend setting up an RSA keypair and using it for SSH authentication; see below for links to some useful SSH-related resources.
Homework
We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms and techniques we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in either Python or Java. If you want to use something else to do the assignment, you are of course free to do so.
The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.
The deliverables for the final project are an in-class presentation and a short paper done in the style of a conference submission: a maximum of eight pages, not counting references. The writeup will be due on June 15. Unless otherwise agreed, this will be a hard deadline, as grades are due later that week.
Grading
Your grade will be based on three things: in-class participation (including paper presentations) (30%), homework (30%), and the final project (40%).
This will be largely "seminar-style" course, and most sessions will involve student-led discussions of journal articles. Everybody should come to class having read the material, and be ready to discuss it as a group.
Resources
Useful books
Websites of note
We will be filling these in as we go along!
Tools
- Sidestep (useful for setting up an SSH proxy for off-campus access to the cluster)
Student Access Statement
Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.