Map/Reduce; Spark

Monday: Applications of Map/Reduce

Lin & Dyer, Chapters 1-4 (pretty quick reading)
Fast, Easy, and Cheap: Construction of Statistical Machine Translation Models with MapReduce, Christopher Dyer et. al., Proc. ACL Workshop on Statistical Machine Translation, pg. 199-207, 2008.

Challenges in building large-scale information retrieval systems, Jeffrey Dean, Invited talk, Proc. of the Second ACM International Conference on Web Search and Data Mining, 2009.
Large Language Models in Machine Translation. Thorsten Brants et. al., Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pg. 858-867, 2007.

Hadoop 2.7 documentation
Java API Documentation
Hadoop 2.7 Streaming Mode
Steve’s Minimal Java API Example
This is a minimal example and Maven project for getting up and running with Hadoop’s Java Map/Reduce API

“Apache Spark: a unified engine for big data processing”, Matei Zaharia et al. Communications of the ACM, (59):11, pp. 56-65. 2016
“Spark: Cluster Computing with Working Sets”, Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.
“Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters.” Zaharia, Matei, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pp. 10-10. USENIX Association, 2012.
“Spark SQL: Relational Data Processing in Spark”, Michael Armbrust et al. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383-1394. ACM, 2015

“Scaling spark in the real world: performance and usability”, Michael Armbrust et al. In Proceedings of the VLDB Endowment - Proceedings of the 41st International Conference on Very Large Data Bases, pp. 1840-1843. 2015

Hortonworks Introduction to Spark
Apache Spark Documentation
- pyspark API Docs
Setting up a Spark Development Environment with Scala
- Various bits and pieces of this tutorial will need to be adjusted, since a few things have changed here and there since it was written, but it ought to be enough to get pointed in the right direction.
- Among the bits and pieces that need to be adjusted: any of the specific details of getting a compiled jar file to run on the actual cluster, since the tutorial assumes that one is using Cloudera’s Hadoop distribution on their cloud infrastructure, which we are not. The basic steps will be similar, but details like hostnames, ports, etc. will all be different.

On the CSLU cluster, before running any spark-related command, make sure to run module load hadoop
On the various spark commands (spark-submit, pyspark, etc.) make sure to use set the “--master” option to “yarn”, and to set the “--num-executors” option to some appropriate number. Otherwise, your job will run locally, and with only one worker process, which defeats the entire point!
You can access your job’s management URL from the main YARN console
Notes on using Jupyter on the bigbirds