Large Language Models in Machine Translation. Thorsten Brants et. al., Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), pg. 858-867, 2007.
Various bits and pieces of this tutorial will need to be adjusted, since a few things have changed here and there since it was written, but it ought to be enough to get pointed in the right direction.
Among the bits and pieces that need to be adjusted: any of the specific details of getting a compiled jar file to run on the actual cluster, since the tutorial assumes that one is using Cloudera’s Hadoop distribution on their cloud infrastructure, which we are not. The basic steps will be similar, but details like hostnames, ports, etc. will all be different.
Bigbird Cluster Notes
On the CSLU cluster, before running any spark-related command, make sure to run module load hadoop
On the various spark commands (spark-submit, pyspark, etc.) make sure to use set the “--master” option to “yarn”, and to set the “--num-executors” option to some appropriate number. Otherwise, your job will run locally, and with only one worker process, which defeats the entire point!