CS5/624: Problem Solving with Large Clusters

Homework 2

Handed out: April 20, 2016
Due: 23:59 on May 4, 2016

Description

The purpose of this homework is to give you practice at working with analytical tools such as Spark and Pig, and answering various analytical questions about the Reddit data set. In /data/reddit you will find bzipped collections of all Reddit comments for 2015 (except for the month of July). Each comment is on its own line, and is represented as a JSON object. A representative example can be seen here.

For this homework, you may choose to do all of it in Spark, all of it in Pig, or mix and match as seems appropriate. You may make plots using whatever plotting tool you wish to use.

Part 1: Activity Counts

Begin by creating a week-by-week activity plot, showing how many comments (in total, across all subreddits) were made in each week of the year— weeks on the x-axis, comment count on the y-axis. Make sure to follow good graphing principles (label your axes, etc.), and feel free to experiment with including additional relevant data. For example, you might shade the plot background color to indicate months of the year, or something like that. Have fun, and be creative!

Create two plots: one for all of Reddit, and another comparing counts for the New York City, Seattle and Portland subreddits (look for "nyc", "seattle" and "portland" in the subreddit field, respectively).

Tip: to find out when a comment was made, look for its created_utc field. This is in "epoch seconds" format, and can be turned into a Python datetime object using the datetime.datetime.fromtimestamp() function.

Tip: NYC, Seattle, and Portland might have very different comment volumes. Try plotting both normalized and un-normalized comment counts. Do you see a difference?

To complete this part of the assignment, provide me with the plots, your code, and a short explanation of how you produced them.

Part 2: Drilling Down

For the city subreddits, there were some weeks that were busier than others. How was the busiest week different from all other weeks? There are a lot of possible ways to look into that, but for this assignment, let's start by finding out what words occurred more frequently during the busiest week of the year relative to the rest of the year for the Portland, Seattle, and New York subreddits.

For each of those three cities, find the busiest week, and come up with the the ten words with the highest Log-Likelihood Ratio (LLR) score (between the busiest week and that subreddit's comments from the rest of the year). You'll want to tokenize each comment (I suggest using NLTK's "Casual" tokenizer), remove stop words, and do some simple normalization. I recommend removing URLs, for starters- you may encounter other categories of token that need to be removed.

To compute the LLR scores, use the simple methodology outlined in this workshop paper by Rayson and Garside. Note that this methodology is not the ideal methodology for identifying significant differences in word usage between corpora! It makes certain statistical assumptions that are very far from valid. However, it should be adequate for the purposes of this assignment. If you're interested, I recommend reading Robert Moore's paper from EMNLP 2004, "On Log-Likelihood-Ratios and the Significance of Rare Events", as well as Significance testing of word frequencies in corpora, by Jefrey Lijffijt et al. in Digital Scholarship in the Humanities, December 2014.

To complete this part of the assignment, give me each city's ten words (along with their LLR scores), your code, and a few paragraphs describing your approach, discussing your results, and talking about what sorts of adventures you had along the way- what went wrong, what went right, what would you do differently next time, etc.

Part 3: Spotting changes in activity

Some subreddits have relatively consistent levels of activity, whereas others change popularity quickly. For each month for which we have data, and starting in February, produce a list of the ten subreddits with the greatest month-over-month relative increase in activity, as well as ten with the greatest relative decrease in activity (since we're missing July, compare August to June). You should apply some sort of threshold, and only consider reddits with a certain baseline amount of activity- otherwise, going from one to two posts per month would represent a massive relative change. I suggest starting your cutoff at the entirely-arbitrary value of 250 (i.e., only look at reddits with at least 250 posts during 2015).

For extra credit, experiment with different values of that cutoff, and report back how things change. At what cutoff value do the results become stable?

To complete this part of the assignment, give me a table for each month, showing the ten subreddits in each category. Please also provide your code, as well as a short writeup describing your approach, discussing your findings, and reporting what you found particularly interesting, surprising, etc. about your results and the assignment.