Data!
Here's the data for HW6. This is every article indexed in MEDLINE as being of the "Clinical Trial" publication type as of early 2017. For rest of the assignment, you will be working with this data set, so your first step must be to download it.
I have split the data up into chunks, with each chunk being a file containing 25,000 records. Each record is on its own line, and is a complete JSON file. You'll need to download each chunk; you may then choose to reassemble them into a single, large file, or work with them as a collection of smaller files.
Downloading this many files by hand gets to be a pain, so you should use the HTML "scraping" techniques that we talked about in class. You will need to install the following Python packages: requests, lxml, and cssselect. All are available via pip or conda.
A general outline to follow with this sort of task:
- Identify the place in the page that has the information you need. In this case, it's the href attribute of the links in the list below.
- View the page source to figure out what the HTML around those elements looks like.
- Are they distinctively labeled somehow, perhaps with id or class attributes?
- If not, work your way up the page's tree structure through their parent elements- maybe it is labeled thusly.
- Hint: In the case of this page, you'll note that the list containing the links to the files has an id attribute.
- Now, figure out a distinctive selector for the element(s) you want. See here for a good tutorial on how to do this if you don't remember from class.
- Use lxml's cssselect() method to grab the elements you need, and get their href attributes
This will get you a list of URLs to the data files; to actually download them to your computer, you have two choices.
- Use requests to download the files directly from your Python script, and write each one to disk.
- Have your Python script write the URLs themselves to a file, and then use the wget command to download the files. Hint: wget has a command-line option, -i, that lets you specify a file full of URLs to download.
Once you've downloaded the files, proceed with the remainder of the assignment as described in Sakai.
Files