CS 5/662, Winter 2023

HW1: Getting our feet wet

Most NLP tasks begin with the same overall pattern:

  1. Get data from somewhere
  2. Figure out how to read/parse it, and find the part we’re interested in
  3. Basic pre-processing (tokenization, stopword removal, unicode normalization, etc.)
  4. Counting stuff

In this assignment, you will be working with some of the basic file formats and Python libraries we will be using this term. You will be reading data from a standard corpus of newswire text, and computing a simple measure of word association to extract within-sentence collocations.

This is not intended to be a difficult assignment, but is meant to help you get used to the sorts of things we’ll be doing more of later in the term, and to give you practice with the core data-wrangling skills needed for effective NLP. Depending on the nature of your programming and command-line background, there may be some new things for you to learn as part of this assignment, so give yourself plenty of time and please do not hesitate to ask me for help if you get stuck!

A note on turning in code: Please set up a Github repository to store your code, and provide me with a link to that repository in Sakai. OHSU has an officially-supported, on-premises instance that is available for student use, located at https://source.ohsu.edu.

Part 0: Getting Set Up

You’ll need to have a working Python 3 installation, and the following libraries installed:

Each of these should be installable via either pip or Anaconda, depending on your preference.

Once you’ve installed NLTK, you’ll need to download some language resources. NLTK can download a variety of resource such as text corpora, pre-trained models, etc. using the downloader module; install the stopwords and punkt packages:

python -m nltk.downloader stopwords punkt

If you have difficulty with these steps, ask for help!

Part 1: Reading some data

This directory contains news documents from the Central News Agency (Taiwan) drawn from the 5th edition of the Gigaword corpus (LDC2011T07). Each file is a compressed XML file containing multiple documents (roughly corresponding to news articles). Each document is wrapped with a <DOC> tag, which also have a type attribute. Immediately below <DOC> tags in the hierarchy, story text is separated from the headline and dateline by the <TEXT> tag. Finally, paragraphs of story text are marked by <P> tags. The first part of the assignment is to download, decompress, and serialize these files, extracting the text of all paragraphs (<P>) which are part of the <TEXT> of all <DOC>s of type="story". Your script should operate on a list of gzipped XML files given as command-line arguments, and send its output to STDOUT. An example invocation should look something like:

python your_deserialization_script.py cna_eng/*.xml.gz > deserialized.txt

Note: You may not use regular expressions for XML parsing and deserialization. There is a good theoretical reason for this: XML is not a regular language. You must use a proper XML parser, and will lose points on the assignment if you do not.

Note: Looking at the contents of the XML files, you’ll note that the text has been “hard-wrapped” (i.e., sentences are split across multiple lines). As part of this script’s operation, undo this process so that each paragraph is on its own line.

What to turn in for this part:

  1. Your program
  2. Sample terminal output, showing perhaps the first 100 lines of its output
  3. A sentence or two describing your approach and any bugs you encountered.

Tips:

Part 2: Structuring the data

The ultimate calculations we want to perform for this assignment are about words- how many there are, their frequencies in the corpus, and their distributions within sentences. In order to get the data in shape for this, we will need to do some restructuring.

  1. Looking at the output from Part 1, you’ll notice that it is a bit disjointed- each line is a paragraph, but we want each line to be a sentence (so we can compute within-sentence statistics). Use the NLTK tokenize module’s sent_tokenize function to split the output from part 1 into discrete sentences.

    By default, this uses the Punkt sentence tokenization algorithm, which is reasonably effective but far from state-of-the-art. Note that, since in its current form, many sentences span multiple lines, you will need to do some cleanup before you can run sent_tokenize!

  2. Once you’ve reconstituted the data into a one-line-per-sentence format, you’re ready to perform word-level tokenization. The tokenize NLTK package includes a word_tokenize function; use it to process each sentence.

  3. Newswire text has a lot of punctuation, which for this assignment we don’t care about. Remove any tokens that are solely punctuation. Finally, for this assignment, we do not care about capitalization, so let’s go ahead and turn everything into upper-case letters.

Your final output from this section should be a file where each line consists of a single sentence, the line’s contents have been tokenized using word_tokenize, tokens consisting solely of punctuation have been stripped, and the text has been upper-cased.

What to turn in:

  1. How many sentences are there in the CNA-GW corpus?
  2. Take a few minute to look at sent_tokenize’s output. Can you spot any places where the Punkt algorithm made a mistake? (It may be easier to answer this question before you have performed steps 2 and 3 above).

Tips:

    import string
    punct_set = set(string.punctuation)
    ...
    no_punct_toks = [t for t in tokens if t not in punct_set]
    

Part 3: Counting and comparing

Now that you’ve got your corpus prepared, you can start doing some analysis. Collocations are pairs of words that frequently occur together, such as “New York”. For this part of the assignment, treat a “word” as a token that comes out of your script from Part 2.

Word counting & distribution

Compute unigram and bigram frequency counts for each word. For this assignment, do not worry about padding your bigrams with start/end-of-sentence markers or anything like that. Perform the following analyses:

  1. How many unique unigram types are present in this corpus? How many unique bigram types?
  2. How many unigram tokens are there?
  3. Produce a rank-frequency plot (similar to those seen on the Wikipedia page for Zipf’s Law) for the unigrams in this corpus.

  4. What are the thirty most common unigrams?

  5. Look a little further down the list; do you begin to see entries in this list that do not look like what you might consider to be a “word”? How might you adjust your processing pipeline from Part 2 to correct for this?

  6. You may notice that the most common are words that occur very frequently in the English language (stopwords). What happens to your type/token counts if you remove stopwords using nltk.corpora’s stopwords list?

  7. After removing stopwords, what are the thirty most common words?
    • Stop and reflect: “What to count as stopwords” is an important design choice that is part of any NLP project. Look at the words contained in the nltk’s stopwords list. Does this list make sense, for this corpus and for this analysis? Are there entries that surprise you? Are there other words you would add?
    • What might some important considerations be when generating a stopwords list?

What to turn in: Your answers to the questions above, including your rank-frequency plot.

Word association metrics

There are many ways to identify collocated words, but one common one is to use Pointwise Mutual Information (discussed in detail in Church & Hanks, 1990, though note that they are actually describing something slightly different, which they term the association ratio). This measure captures how much more likely it is that two events occur together than would be the case if the events were statistically independent.

\[PMI(w_1, w_2) = \log_2 \frac{P(w_1, w_2)}{P(w_1)P(w_2)}\]

For this part of the assignment, convert your unigram and bigram frequency counts to unigram and bigram probabilities (ignore issues of smoothing your probability estimates, for now). We will roughly follow Church & Hanks’s method for computing the probabilities as described on page 23 of the linked paper (from the paragraph starting “In our application…”); for this assignment, we will consider our “window size” to be 1: in other words, literal bigrams rather than words sharing a context window, normalized by the size of the corpus:

\[P(x,y) \approx \frac{c(x,y)}{N}\]

Deliverables:

  1. Recalling Emily Bender’s sage advice- “Look at your data!”- examine the 30 highest-PMI word pairs computed using this method, along with their unigram and bigram frequencies. What do you notice?
  2. One drawback of using PMI in this way is that it is unstable when word frequencies are low. There are a variety of ways to solve this problem; one common way is to simply set a threshold, and only consider bigrams that occur with frequency above that threshold. Experiment with a few different threshold values, and report on what you observe.
  3. With a threshold of 100, what are the 10 highest-PMI word pairs?
  4. Examine the PMI for “New York”. Explain in your own words why it is not higher.

What to turn in:

Your answers to the numbered and bullet-pointed questions above.

assignment index