Many real-world computational problems involve data sets that are too large to process on a single computer, or that have other characteristics— fault-tolerance, etc.— that require multiple computers working together. Examples include analysis of high-throughput genomic or proteomic data, data analytics over very large data sets, large-scale social network analysis, training machine learning models on "web scale" data sets, and so forth.
In this course, we will explore a variety of approaches to solving these kinds of problems through a mixture of lectures and student-led discussions of the research literature in the field. We will also hear from several guest lecturers with practical experience applying these kinds of algorithms in both academia and industry.
In addition to reading and discussing articles, students will become familiar with the Hadoop map-reduce environment as well as several other such systems through class assignments. There will also be a final project on the subject of the student's choice.
Day |
Date |
Topic |
Presenters |
Monday |
3/31 |
Course Overview, Infrastructure AT CSLU |
Steve |
-
Intro Slides
-
The Google File System, Sanjay Ghemawat et. al., ACM Symposium on Operating Systems Principles (SOSP), pg. 29-43, 2003.
-
Lustre File System, White Paper, Sun Microsystems. (Note: this is horrifically old, but the content is still generally accurate, at least in spirit.)
-
Data Center As a Computer, 2nd ed., Luiz Andre Barroso, Jimmy Clidaras, Urs Hölzle, Synthesis Lectures on Computer Architecture, 2013. [pdf]
- Lin-Dyer Textbook Chapter 1-3
-
MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol. 51, no. 1, pg. 107-113, 2008.
-
MapReduce: A Flexible Data Processing Tool, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, vol 53, no. 1, pg. 72-77, 2010.
-
MapReduce: The Programming Model and Practice, J. Zhao et. al, SIGMETRICS'09 Tutorial, 2009.
-
MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004.
-
Designs, Lessons and Advice from Building Large Distributed Systems, Jeff Dean, Keynote LADIS 2009.
-
Relative latency of various operations (cache miss, disk access, etc.).
|
Wednesday |
4/2 |
Inverted Indexing, MT Language Models, Multicore MapReduce |
Joel, Shiran |
-
Challenges in building large-scale information retrieval systems, Jeffrey Dean, Invited talk, Proc. of the Second ACM International Conference on Web Search and Data Mining, 2009.
-
Lin & Dyer, Chapter 4: Inverted Indexing for Text Retrieval
-
Large Language Models in Machine
Translation, Thorsten Brants et. al., Proc. Joint Conference
on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP), pg. 858-867,
2007. [Joel's Slides]
-
Evaluating Map-Reduce for
Multicore and Multiprocessor Systems, Proc. High Performance
Computer Architecture (HPCA), Best Paper Award, 2007, Phoneix Code. [Shiran's Slides]
For further reading:
-
For more about inverted indexing (with or without MapReduce):
- Jimmy Lin, Donald Metzler, Tamer Elsayed, and Lidan Wang. Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search. Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), November 2009, Gaithersburg, Maryland.
- An alternativec (and multicore!) approach to parallel inverted indexing: Zheng Wei; Jaja, J., "An Optimized High-Throughput Strategy for Constructing Inverted Files," Parallel and Distributed Systems, IEEE Transactions on , vol.23, no.11, pp.2033,2044, Nov. 2012
- Lin, Jimmy. "The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce." 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. Vol. 1. 2009.
- Inverted Files for Text Search Engines, Justin Zobel et. al., ACM Computing Survey, vol. 38, no. 2, 2006.
-
For interesting discussion of what you can do with web-scale language models, consult:
- Whitelaw, Casey, et al. "Using the web for language independent spellchecking and autocorrection." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics, 2009.
- Lin, Jimmy, Rion Snow, and William Morgan. "Smoothing techniques for adaptive online language models: topic tracking in tweet streams." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
- Huang, Jian, et al. "Exploring web scale language models for search query processing." Proceedings of the 19th international conference on World wide web. ACM, 2010.
-
For more on multicore MapReduce:
- Optimizing MapReduce for Multicore
Architectures, Yandong Mao et al, Technical Report,
MIT-CSAIL-TR-2010-020, MIT.
-
Tiled-MapReduce: optimizing resource
usages of data-parallel applications on multicore with tiling,
R. Chen et al., Proc International Conference on Parallel
Architectures and Compilation Techniques, 2010.
-
Where does the speedup go: Quantitative
modeling of performance losses in shared-memory programs, Seon
Wook Kim et al, Parallel Processing Letters, 2001.
|
Monday |
4/7 |
More MT, Document Similarity, Topic Modeling |
Alireza & Mahsa
|
- Fast, Easy, and Cheap: Construction of
Statistical Machine Translation Models with MapReduce,
Christopher Dyer et. al., Proc. ACL Workshop on Statistical
Machine Translation, pg. 199-207, 2008. slides
- Pairwise Document Similarity in Large
Collections with MapReduce, Tamer Elsayed et. al.,
Association for Computational Linguistics (ACL), pg. 265-268,
2008.
- Jimmy Lin, Michael DiCuccio, Vahan Grigoryan, W. John Wilbur, Navigating information spaces: A case study of related article search in PubMed, Information Processing & Management, Volume 44, Issue 5, September 2008, Pages 1771-1783
|
Wednesday |
4/9 |
Cancelled! |
N/A |
|
Wednesday |
4/16 |
Distributed approaches to linear algebra |
Jesse & Joseph |
Additional reading (highly recommended!):
-
Ensemble Nystrom Method, S. Kumar et. al., Proc. Neural Information Processing Systems (NIPS), 2010, Winner of Best Student Paper at the New York Academy of Sciences 2009 Symposium on ML.
-
Parallel Spectral Clustering, Wen-Yen Chen et. al., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010. [code]
|
Friday |
4/18 |
Machine Learning: perceptron training, parallel svm, boosted decision trees |
Shiran, Mahsa, Archana |
- Distributed Training Strategies for the Structured Perceptron, Ryan McDonald et. al., Proc NAACL, 2010. [Mahsa's slides]
- Parallelizing SVM on Distributed Computers, Edward Y. Chang et. al., Proc. Neural Information Processing Systems (NIPS), vol. 20, 2007. code
- Stochastic Gradient Boosted Decision Trees, Jerry Le et. al., Proc. ACM Conference on Information and Knowledge Management (CIKM), no. 4, pg. 2061-2064, 2009. [Shrian's slides]
|
Monday |
4/21 |
Collaborative Filtering, LDA, Mahout tutorial |
Joseph, Golnar |
-
Apache Mahout Tutorial
-
Collaborative Filtering:
- Schelter, Sebastian, Christoph Boden, and Volker Markl. "Scalable similarity-based neighborhood methods with mapreduce." In Proceedings of the sixth ACM conference on Recommender systems, pp. 163-170. ACM, 2012.
-
Schelter, Sebastian, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl. "Distributed matrix factorization with mapreduce using a series of broadcast-joins." In Proceedings of the 7th ACM conference on Recommender systems, pp. 281-284. ACM, 2013.
- LDA: [Golnar's Slides]
Additional Reading:
- More on LDA:
- More to come!
|
Wednesday |
4/23 |
Graphs 1 |
Joel, Mahsa |
- Graph Twiddling in a MapReduce World, Jonathan Cohen, Computing in Science and Engineering, vol. 11, no. 4, pg. 29-41 (Presenter: Joel)
- Pregel: A System for Large-Scale Graph Processing, Grzegorz Malewicz et. al., Proc. International Conference on Management of Data (SIGMOD), no. 12, pg. 135-146, 2010. [Mahsa's slides]
Additional background reading (strongly recommended):
|
Monday |
4/28 |
Graphs 2 |
Shiran, Golnar, Jesse |
- Stanton, Isabelle, and Gabriel Kliot. "Streaming graph partitioning for large distributed graphs." In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222-1230. ACM, 2012. (Presenter: Shiran)
- Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011. (Presenter: Golnar)
- Kollias, Giorgos, Madan Sathe, Olaf Schenk, and Ananth Grama. "Fast parallel algorithms for graph similarity and matching." Journal of Parallel and Distributed Computing (2014). (Presenter: Jesse)
Additional reading:
|
Wednesday |
4/30 |
Distributed database algorithms (and K-means++) |
Archana & Joel |
- Bahmani, Bahman, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. "Scalable k-means++." Proceedings of the VLDB Endowment 5, no. 7 (2012): 622-633. (Presenter: Archana)
- Corbett, James C., Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat et al. "Spanner: Google’s globally distributed database." ACM Transactions on Computer Systems (TOCS) 31, no. 3 (2013): 8. (Joel)
Additional Reading:
- Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (highly recommended!)
- Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. "Hive-a petabyte scale data warehouse using hadoop." In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp. 996-1005. IEEE, 2010.
|
Monday |
5/5 |
Distributed data stores (BigTable, Hive, etc.) |
Joseph, Golnar |
- BigTable: A Distributed storage system for structured data Fay Chang et. al., Proc. Symposium on Operating System Design and Implementation (OSDI), pages 205-218, 2006. (Joseph)
- Thusoo, Ashish, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1626-1629. (Golnar)
|
Friday |
5/9 |
Guest Speaker: Izhak Shafran, PhD (Google) |
TBA |
- Dean, Jeffrey, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao et al. "Large Scale Distributed Deep Networks." In NIPS 2012, pp. 1232-1240.
- Georg Heigold, Erik McDermott, Vincent Vanhoucke, Andrew Senior, Michiel Bacchiani. "Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks" in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2014.
Additional Reading:
- Snoek, Jasper and Larochelle, Hugo and Adams, Ryan P. Practical Bayesian Optimization of Machine Learning Algorithms In NIPS 2012, pp. 2951--2959.
- Coates, Adam, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. "Deep learning with COTS HPC systems." In Proceedings of The 30th International Conference on Machine Learning, pp. 1337-1345. 2013.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
- Ian J. Goodfellow; Yaroslav Bulatov; Julian Ibarz; Sacha Arnoud; Vinay Shet. "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" In Proceedings of Workshop at ICLR, 2014.
|
Monday |
5/12 |
Distributed data analysis tools |
Joel, Golnar, Jesse |
- Gates, Alan F., Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. "Building a high-level dataflow system on top of Map-Reduce: the Pig experience." Proceedings of the VLDB Endowment 2, no. 2 (2009): 1414-1425. (Joel)
- Melnik, Sergey, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. "Dremel: interactive analysis of web-scale datasets." Proceedings of the VLDB Endowment 3, no. 1-2 (2010): 330-339. (Golnar)
- Pike, Rob, Sean Dorward, Robert Griesemer, and Sean Quinlan. "Interpreting the data: Parallel analysis with Sawzall." Scientific Programming 13, no. 4 (2005): 277-298. (Jesse)
|
Wednesday |
5/14 |
Distributed File Systems |
Keith Mannthey, Intel Corp. |
|
Wednesday |
5/21 |
Genomics |
Guest Speaker: Myron Peto, PhD (Spellman Lab, OHSU) |
- Heng Li and Richard Durbin. "Fast and accurate short read alignment with Burrows–Wheeler transform" Bioinformatics Vol. 25 no. 14, 2009, pages 1754–1760.
- Zhang, Jing, Heshan Lin, Pavan Balaji, and Wu-chun Feng. "Optimizing Burrows-Wheeler Transform-Based Sequence Alignment on Multicore Architectures." In Cluster, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposium on, pp. 377-384. IEEE, 2013.
- Kristian Cibulskis, et al. "Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples" Nature Biotechnology, volume 31 number 3, March 2013, pages 213-219.
- Mark DePristo, et al. "A framework for variation discovery and genotyping using next-generation DNA sequencing data" Nature Genetics volume 43 number 5, May 2011, pages 491-498.
|
Friday |
5/23 |
Guest Speaker: Kyle Ambert, PhD (Intel Corp.) [slides] |
|
Monday |
5/26 |
No class- Memorial day! |
TBA |
|
Wednesday |
5/28 |
Condor workflows, MPI |
Shiran |
- Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005.
- Peter Couvares, Tevik Kosar, Alain Roy, Jeff Weber and Kent Wenger, "Workflow in Condor", in In Workflows for e-Science, Editors: I.Taylor, E.Deelman, D.Gannon, M.Shields, Springer Press, January 2007
- Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of the Condor batch system", Journal of Physics: Conference Series, Vol. 331, No. 6, 2011.
- [Shiran's Slides]
Additional Condor reading:
|
Monday |
6/2 |
Consensus algorithms & other alternative approaches (Steve out of town) |
TBA |
- XuanLong Nguyen, Martin J. Wainwright, and Michael I. Jordan. Nonparametric Decentralized Detection using Kernel Methods, IEEE Transactions on Signal Processing, pp 4053-4066, 2005; Outstanding ICML student paper award. (Jesse)
- Lamport, Leslie. "Paxos made simple." ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58. (yes, it was "recommended" before, but we're going to do it for real this time) (Mahsa)
|
Friday |
6/6 |
Streaming/"Realtime" approaches |
TBA |
- Mishne, Gilad, Jeff Dalton, Zhenghua Li, Aneesh Sharma, and Jimmy Lin. "Fast data in the era of big data: Twitter's real-time related query suggestion architecture." In Proceedings of the 2013 international conference on Management of data, pp. 1147-1158. ACM, 2013. (Archana)
- Spark: (Joseph)
- Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: cluster computing with working sets." In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10-10. 2010.
- Zaharia, Matei, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. "Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pp. 10-10. USENIX Association, 2012.
- Ananthanarayanan, Rajagopal, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. "Photon: Fault-tolerant and scalable joining of continuous data streams." In Proceedings of the 2013 international conference on Management of data, pp. 577-588. ACM, 2013. (Joel)
Additional reading:
|
Monday |
6/9 |
Xeon PHI, etc. |
Michael Julier (Intel Corp.) |
|
Wednesday |
6/11 |
"Cloud" clusters— EC2, StarCluster, PiCloud; setup & administration |
TBA |
|
Monday |
6/16 |
Project presentations! |
TBA |
|
Wednesday |
6/18 |
Project presentations! |
TBA |
|