Course CS506 Special Topics: Problem Solving with Large Clusters

Problem Solving with Large Clusters
Steven Bedrick, CSLU, OHSU

News and Assignments Course Information Lectures and Readings Homework Textbooks Schedule Links

News and Assignments

Links fixed for Monday, 4/7
The week of 4/14-4/18, we will meet on Wednesday and Friday, not Monday and Wednesday.
Updated the link to "Data Center as Computer" to the new(ish) 2^nd edition.
Homework 1 posted!
First meeting on 3/31.

Course Information

Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.

In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.

In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.

Prerequisites: A graduate level course on machine learning or probability and statistics. Students should be comfortable coding in at least one programming language, and will find the course much easier if they are familiar with the UNIX command-line environment than if they are not.

Grading: Students will be graded as follows: 20% final project; 20% assignments; 60% participation (including paper presentations). This will largely be a "seminar-style" course, and most sessions will revolve around student-led discussions of papers. Everybody should come to class having read the papers and prepared to participate in the discussion.

Final Project:As noted, the final project will make up 20% of the course grade. The project will be on a subject of the student's choosing (in consultation with the instructor), and will require both a written report and a final presentation. If possible, the source code, etc. for the project should be publicly releasable. Student projects in previous iterations of this course have resulted in publications, so don't be afraid to think big!

Lectures and Readings

Note that this schedule is currently somewhat "loose", and will shift depending on guest lecturer availability.

Day	Date	Topic	Presenters
Monday	3/31	Course Overview, Infrastructure AT CSLU	Steve
Intro Slides The Google File System, Sanjay Ghemawat et. al., ACM Symposium on Operating Systems Principles (SOSP), pg. 29-43, 2003. Lustre File System, White Paper, Sun Microsystems. (Note: this is horrifically old, but the content is still generally accurate, at least in spirit.) Data Center As a Computer, 2^nd ed., Luiz Andre Barroso, Jimmy Clidaras, Urs Hölzle, Synthesis Lectures on Computer Architecture, 2013. [pdf] Lin-Dyer Textbook Chapter 1-3 MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1, pg. 107-113, 2008. MapReduce: A Flexible Data Processing Tool, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol 53, no. 1, pg. 72-77, 2010. MapReduce: The Programming Model and Practice, J. Zhao et. al, SIGMETRICS'09 Tutorial, 2009. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004. Designs, Lessons and Advice from Building Large Distributed Systems, Jeff Dean, Keynote LADIS 2009. Relative latency of various operations (cache miss, disk access, etc.).
Wednesday	4/2	Inverted Indexing, MT Language Models, Multicore MapReduce	Joel, Shiran
Challenges in building large-scale information retrieval systems, Jeffrey Dean, Invited talk, Proc. of the Second ACM International Conference on Web Search and Data Mining, 2009. Lin & Dyer, Chapter 4: Inverted Indexing for Text Retrieval Large Language Models in Machine Translation, Thorsten Brants et. al., Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pg. 858-867, 2007. [Joel's Slides] Evaluating Map-Reduce for Multicore and Multiprocessor Systems, Proc. High Performance Computer Architecture (HPCA), Best Paper Award, 2007, Phoneix Code. [Shiran's Slides] For further reading: For more about inverted indexing (with or without MapReduce): Jimmy Lin, Donald Metzler, Tamer Elsayed, and Lidan Wang. Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search. Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), November 2009, Gaithersburg, Maryland. An alternativec (and multicore!) approach to parallel inverted indexing: Zheng Wei; Jaja, J., "An Optimized High-Throughput Strategy for Constructing Inverted Files," Parallel and Distributed Systems, IEEE Transactions on , vol.23, no.11, pp.2033,2044, Nov. 2012 Lin, Jimmy. "The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce." 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. Vol. 1. 2009. Inverted Files for Text Search Engines, Justin Zobel et. al., ACM Computing Survey, vol. 38, no. 2, 2006. For interesting discussion of what you can do with web-scale language models, consult: Whitelaw, Casey, et al. "Using the web for language independent spellchecking and autocorrection." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009. Lin, Jimmy, Rion Snow, and William Morgan. "Smoothing techniques for adaptive online language models: topic tracking in tweet streams." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011. Huang, Jian, et al. "Exploring web scale language models for search query processing." Proceedings of the 19th international conference on World wide web. ACM, 2010. For more on multicore MapReduce: Optimizing MapReduce for Multicore Architectures, Yandong Mao et al, Technical Report, MIT-CSAIL-TR-2010-020, MIT. Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling, R. Chen et al., Proc International Conference on Parallel Architectures and Compilation Techniques, 2010. Where does the speedup go: Quantitative modeling of performance losses in shared-memory programs, Seon Wook Kim et al, Parallel Processing Letters, 2001.
Monday	4/7	More MT, Document Similarity, Topic Modeling	Alireza & Mahsa
Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce, Christopher Dyer et. al., Proc. ACL Workshop on Statistical Machine Translation, pg. 199-207, 2008. slides Pairwise Document Similarity in Large Collections with MapReduce, Tamer Elsayed et. al., Association for Computational Linguistics (ACL), pg. 265-268, 2008. Jimmy Lin, Michael DiCuccio, Vahan Grigoryan, W. John Wilbur, Navigating information spaces: A case study of related article search in PubMed, Information Processing & Management, Volume 44, Issue 5, September 2008, Pages 1771-1783
Wednesday	4/9	Cancelled!	N/A

Wednesday	4/16	Distributed approaches to linear algebra	Jesse & Joseph
"Elemental: A new framework for distributed memory dense matrix computations." Poulson, Jack, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. ACM Transactions on Mathematical Software (TOMS) 39, no. 2 (2013): 13. HAMA: An efficient matrix computation with the MapReduce framework, Sangwon et. al., Proc. IEEE Conference on Clound Computing Technology and Science, 2007. Additional reading (highly recommended!): Ensemble Nystrom Method, S. Kumar et. al., Proc. Neural Information Processing Systems (NIPS), 2010, Winner of Best Student Paper at the New York Academy of Sciences 2009 Symposium on ML. Parallel Spectral Clustering, Wen-Yen Chen et. al., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [code]
Friday	4/18	Machine Learning: perceptron training, parallel svm, boosted decision trees	Shiran, Mahsa, Archana
Distributed Training Strategies for the Structured Perceptron, Ryan McDonald et. al., Proc NAACL, 2010. [Mahsa's slides] Parallelizing SVM on Distributed Computers, Edward Y. Chang et. al., Proc. Neural Information Processing Systems (NIPS), vol. 20, 2007. code Stochastic Gradient Boosted Decision Trees, Jerry Le et. al., Proc. ACM Conference on Information and Knowledge Management (CIKM), no. 4, pg. 2061-2064, 2009. [Shrian's slides]
Monday	4/21	Collaborative Filtering, LDA, Mahout tutorial	Joseph, Golnar
Apache Mahout Tutorial Collaborative Filtering: Schelter, Sebastian, Christoph Boden, and Volker Markl. "Scalable similarity-based neighborhood methods with mapreduce." In Proceedings of the sixth ACM conference on Recommender systems, pp. 163-170. ACM, 2012. Schelter, Sebastian, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl. "Distributed matrix factorization with mapreduce using a series of broadcast-joins." In Proceedings of the 7th ACM conference on Recommender systems, pp. 281-284. ACM, 2013. LDA: [Golnar's Slides] Asynchronous Distributed Learning of Topic Models, Arthur Asuncion et. al., Proc. Neural Information Processing Systems (NIPS), pg. 81-88, 2008. Distributed Inference for Latent Dirichlet Allocation, David Newman et. al., Proc Neural Information Processing Systems (NIPS), vol. 20, pg. 1081-1088, 2007. Additional Reading: More on LDA: Tutorial on Latent Dirichlet Allocation, David Blei, Machine Learning Summer School, 2009 (with video, excellent!) More to come!
Wednesday	4/23	Graphs 1	Joel, Mahsa
Graph Twiddling in a MapReduce World, Jonathan Cohen, Computing in Science and Engineering, vol. 11, no. 4, pg. 29-41 (Presenter: Joel) Pregel: A System for Large-Scale Graph Processing, Grzegorz Malewicz et. al., Proc. International Conference on Management of Data (SIGMOD), no. 12, pg. 135-146, 2010. [Mahsa's slides] Additional background reading (strongly recommended): Link Mining: A Survey, Lise Getoor et. al., 7(2):3-12, SIGKDD Exploratory Newsletter, 2005. "Processing large-scale graph data: A guide to current technology.", Sherif Sakr, IBM developerWorks, June 2013.
Monday	4/28	Graphs 2	Shiran, Golnar, Jesse
Stanton, Isabelle, and Gabriel Kliot. "Streaming graph partitioning for large distributed graphs." In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222-1230. ACM, 2012. (Presenter: Shiran) Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011. (Presenter: Golnar) Kollias, Giorgos, Madan Sathe, Olaf Schenk, and Ananth Grama. "Fast parallel algorithms for graph similarity and matching." Journal of Parallel and Distributed Computing (2014). (Presenter: Jesse) Additional reading: Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. "The PageRank citation ranking: Bringing order to the web." (1999).
Wednesday	4/30	Distributed database algorithms (and K-means++)	Archana & Joel
Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. "Scalable k-means++." Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633. (Presenter: Archana) Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 8. (Joel) Additional Reading: Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (highly recommended!) Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. "Hive-a petabyte scale data warehouse using hadoop." In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 996-1005. IEEE, 2010.
Monday	5/5	Distributed data stores (BigTable, Hive, etc.)	Joseph, Golnar
BigTable: A Distributed storage system for structured data Fay Chang et. al., Proc. Symposium on Operating System Design and Implementation (OSDI), pages 205-218, 2006. (Joseph) Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1626-1629. (Golnar)
Friday	5/9	Guest Speaker: Izhak Shafran, PhD (Google)	TBA
Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao et al. "Large Scale Distributed Deep Networks." In NIPS 2012, pp. 1232-1240. Georg Heigold, Erik McDermott, Vincent Vanhoucke, Andrew Senior, Michiel Bacchiani. "Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks" in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2014. Additional Reading: Snoek, Jasper and Larochelle, Hugo and Adams, Ryan P. Practical Bayesian Optimization of Machine Learning Algorithms In NIPS 2012, pp. 2951--2959. Coates, Adam, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. "Deep learning with COTS HPC systems." In Proceedings of The 30th International Conference on Machine Learning, pp. 1337-1345. 2013. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. Ian J. Goodfellow; Yaroslav Bulatov; Julian Ibarz; Sacha Arnoud; Vinay Shet. "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" In Proceedings of Workshop at ICLR, 2014.
Monday	5/12	Distributed data analysis tools	Joel, Golnar, Jesse
Gates, Alan F., Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. "Building a high-level dataflow system on top of Map-Reduce: the Pig experience." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1414-1425. (Joel) Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. "Dremel: interactive analysis of web-scale datasets." Proceedings of the VLDB Endowment 3, no. 1-2 (2010): 330-339. (Golnar) Pike, Rob, Sean Dorward, Robert Griesemer, and Sean Quinlan. "Interpreting the data: Parallel analysis with Sawzall." Scientific Programming 13, no. 4 (2005): 277-298. (Jesse)
Wednesday	5/14	Distributed File Systems	Keith Mannthey, Intel Corp.

Wednesday	5/21	Genomics	Guest Speaker: Myron Peto, PhD (Spellman Lab, OHSU)
Heng Li and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler transform" Bioinformatics Vol. 25 no. 14, 2009, pages 1754–1760. Zhang, Jing, Heshan Lin, Pavan Balaji, and Wu-chun Feng. "Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures." In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pp. 377-384. IEEE, 2013. Kristian Cibulskis, et al. "Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples" Nature Biotechnology, volume 31 number 3, March 2013, pages 213-219. Mark DePristo, et al. "A framework for variation discovery and genotyping using next-generation DNA sequencing data" Nature Genetics volume 43 number 5, May 2011, pages 491-498.
Friday	5/23	Guest Speaker: Kyle Ambert, PhD (Intel Corp.) [slides]

Monday	5/26	No class- Memorial day!	TBA

Wednesday	5/28	Condor workflows, MPI	Shiran
Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. Peter Couvares, Tevik Kosar, Alain Roy, Jeff Weber and Kent Wenger, "Workflow in Condor", in In Workflows for e-Science, Editors: I.Taylor, E.Deelman, D.Gannon, M.Shields, Springer Press, January 2007 Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of the Condor batch system", Journal of Physics: Conference Series, Vol. 331, No. 6, 2011. [Shiran's Slides] Additional Condor reading: Nicholas Coleman, "Distributed Policy Specification and Interpretation with Classified Advertisements", Practical Aspects of Declarative Languages, Lecture Notes in Computer Science Volume 7149, 2012, pp 198-211, January 2012.
Monday	6/2	Consensus algorithms & other alternative approaches (Steve out of town)	TBA
XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Nonparametric Decentralized Detection using Kernel Methods, IEEE Transactions on Signal Processing, pp 4053-4066, 2005; Outstanding ICML student paper award. (Jesse) Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (yes, it was "recommended" before, but we're going to do it for real this time) (Mahsa)
Friday	6/6	Streaming/"Realtime" approaches	TBA
Mishne, Gilad, Jeff Dalton, Zhenghua Li, Aneesh Sharma, and Jimmy Lin. "Fast data in the era of big data: Twitter's real-time related query suggestion architecture." In Proceedings of the 2013 international conference on Management of data, pp. 1147-1158. ACM, 2013. (Archana) Spark: (Joseph) Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: cluster computing with working sets." In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10-10. 2010. Zaharia, Matei, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. "Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pp. 10-10. USENIX Association, 2012. Ananthanarayanan, Rajagopal, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. "Photon: Fault-tolerant and scalable joining of continuous data streams." In Proceedings of the 2013 international conference on Management of data, pp. 577-588. ACM, 2013. (Joel) Additional reading: de Carvalho, Otávio M., Eduardo Roloff, and Philippe OA Navaux. "A Survey of the State-of-the-art in Event Processing." Dena, Dragos, Mihai Bucicoiu, and Mircea Bardac. "A managed distributed processing pipeline with Storm and Mesos." In Networking in Education and Research, 2013 RoEduNet International Conference 12th Edition, pp. 1-6. IEEE, 2013.
Monday	6/9	Xeon PHI, etc.	Michael Julier (Intel Corp.)

Wednesday	6/11	"Cloud" clusters— EC2, StarCluster, PiCloud; setup & administration	TBA

Monday	6/16	Project presentations!	TBA

Wednesday	6/18	Project presentations!	TBA

Homework

Assignment	Assigned	Due
Homework 1: Word counting, etc.	3/31/14	4/6/14

Textbook and Other Useful Resources

Note: For recently developed techniques, we will rely on selected papers, which will be provided in required readings.

Schedule

Meetings	Generally, M-W 1600-1730 (some weeks we may meet on Friday instead of Monday or Wednesday)
Venue	SoN 116
Office hours	By appointment (request by email)

Links

Relevant Software Tools & Other Resources
Note: This list, while useful, is out of date. We will update it throughout the course!