2012-11-17

Get the exit polls from CNN using R and Python

Yesterday I posted an example of plotting 2012 U.S. presidential exit poll results using ggplot2. There I took for granted that a data.frame containing all we need resides in a file called "PresExitPolls2012.Rdata". Today I want to show how I scraped the data from CNN.

The challenge

At first I tried to scrape the site using RCurl and the XML package. But the result was very disappointing. I just got empty data.frames while all browsers I used showed the data. Looking at the source code of the page, however, was equally disappointing:

Where I expected the percentage of say women voting for Romney, I saw a javascript variable name. Only looking at the generated source with Firebug revealed the data. The CNN pages are dynamically created by javascript that jqueried the data into variables. No way getting the data with RCurl.

The solution

So I needed a real browser that could be controlled by a script. I decided to use a Python script to read the generated html from CNN. Here's the Python code that draws heavily on a thread I stumbled upon in a German forum:

Next I needed a function in R that puts together the URL for one of the CNN state sites, calls the Python and returns a page tree of the generated html. getStateData() does the job:

The page tree getStateData returns contains a lot of noise like preliminary county results for some, but only some, of the counties. There are some "fake" exit polls designed to explain "ho to read exit polls". And for every question asked the results appear a couple of times.

Filtering out the noise

To separate the wheat from the chaff, the grain from the husk, I split the job over two functions, parseEpNode and getExitPolls.

getExitPolls parses the tree using XPath, then calls parseEpNode for each of the nodes containing exit polls. (As an aside: this is an application of the "Split-Apply-Combine Strategy for Data Analysis" (pdf) described by Hadley Wickham when he introduced the plyr package. Ironically my getExitPolls doesn't use plyr::llply but the R standard lapply, though it makes use of plyr::rbind.fill...)

parseEpNode is the real work horse of the process. It filters out duplicate entries and demo polls. Again it relies on the Split-Apply-Combine Strategy without using l*ply. Sometimes lapply is easy enough, and Hadley himself uses it internally for some cases as well.

Putting it all together

This script puts it all together and produces the Rdata file the existence of which I only assumed yesterday. It starts with a list of the 19 states + D.C. where no exit polls have been conducted in 2012 taken from the Washington Post and puts together the states of interest, again as a list to which getExitPolls can be lapply'd.

A probably much shorter post will add some improvements to the process. More later...

Please check the instructions on http://www.riverbankcomputing.com/static/Docs/PyQt4/html/

I use Python 2.6.4 and PyQt Version 4.6, maybe the code will have to be slightly modified when using later versions.

Another hint: check the indentation, especially when you're using copy'n'paste. As you know the indentation is syntactically significant in python, and is sometimes garbled when passed through the clipboard...