This is a guest post by Martha Rotter, co-founder of Woop.ie and recently launched Irish technology magazine Idea.

Hey remember the Wikipedia blackout? I do, because I was highly amused by the number of students panicking due to papers or homework they seemingly could not complete without this one website.

One of my favourite things to do with ScraperWiki is to capture people’s reactions and sentiments, and then try to make predictions based on the data. I call it a “Zeitgeist Parse”, because I’m looking for the general public’s response to some event currently happening. Looking at the barrage of tweets coming from confused and frustrated students, I wondered could we predict an upcoming epidemic of bad grades or test results.

PROCESS

I built a few quick scrapers to grab tweets related to Wikipedia blackouts. The queries I used were “wikipedia AND paper” and “wikipedia AND homework”. I thought there might be slight variations in what people with homework were worried about versus maybe more detailed term papers or reports. You can see the Python code for them on my Scraperwiki profile.

After the results were stored, I wanted to do something very simple. I wanted to parse all of the records and get the words tweeted most frequently. From there, I could start to analyze the data more clearly and find patterns and trends.

One way to do this is to take the data & use something like IBM’s ManyEyes to get a visualization of frequently used text. This is handy if you want a Tag Cloud or basic chart to view the results.

However I was conscious of the fact that with so many tweets, it could be easy to miss smaller but still significant trends. A really easy way to parse and sort text is by using Excel + VBA. Since ScraperWiki can export to CSV, I downloaded the CSV files & wrote a small macro to walk through the words and count instances of each of them. After sorting the results, I had a fairly solid picture of the top words used by protesting tweeters.

WHAT I DIDN’T FIND

I actually did not find specific subjects. Hardly any comments about which course or paper was in danger due to the shutdown. Few worries about particular subjects, the notable exceptions being history, with 50 instances and English with 37 instances appearing in the data. For a moment, my experiment was basically a waste of time and processing power.

WHAT I DID FIND

But as I examined the results, what I actually found was slightly more interesting. After removing obvious words like Wikipedia and homework, I started to see a few recurring patterns in terms of type of language used.

The panic of the situation jumps out immediately. Words like GOTTA (I didn’t remove capitalization as it adds context in this scenario), fail, DOWN, NEED, TOMORROW, extension, justmyluck, screwed, and even HLP!!!!!!! appear in high numbers throughout the results.

Next I noticed the very emotional nature of the language. As expected, lots of swearing and foul language appears. But also high instances of things like hate, mad, suck, freaking, omfg, fixitnow, and of course WTF showed up in the data. As someone who in college definitely did my share of writing papers the night before they were due, I understand the terror and panic. On the other hand, I was usually surrounded by library books I had checked out (probably that day) with no fear that they might suddenly go blank.

The last pattern that I noticed was one of interesting hashtags. These included expected ones like #blackout, #badtiming, #PIPA, #stopSOPA, #wikipediablackout, and #sopa. But also some really bizarre ones that I have no idea how they related to the situation, and may simply remain a mystery: #fratproblems, #thekidsareourfuture, #BingGrlProblems, #SHOUTOUT, and #cooooooooooooooooooooooooooooooool. But my favourite one was probably #GoToALibrary!

SO YOU WANNA CREATE A ZEITGEIST PARSE?

Start by identifying your query parameters. Are you searching by words, by geography, by date? Remember that Twitter’s Search API only goes back a few days, so if you’re looking for anything older than a week this API won’t work. Twitter’s API documentation is great but does change every so often so keep an eye on Twitter Developers for the most up-to-date information about their API and what you can use as parameters for the Search API.

Once you have defined your query, the next step is to create your ScraperWiki scraper with the information. Feel free to copy the source from one of my scrapers like this one. and update with your own parameters.

Next you’ll need to set up the scraper to run one or more times. How often do you want it to run to get useful results? You can run it once & download the data as a JSON or CSV file, or as a SQLite database. Or you can schedule it to run at regular intervals and download the info yourself each time.

After you have the data you need, all you have left to do is analyse. I mentioned ManyEyes earlier, which you can use to get some nice visualizations quite easily or you can use Excel or Google Refine to parse and examine the data. If you’re comfortable with JavaScript, something like HighCharts can help to create nice, interactive visualizations easily from your data.

And now you have a good overview of what people think about the given topic in either a dataset or visualization. Hopefully you made some predictions about what you would find so you can validate your predictions or, as in my case here, observe something completely different.

SUMMARY

Writing a quick Python method using ScraperWiki to query Twitter’s search API is fairly straightforward. Finishing a term paper without using Wikipedia on the other hand? Not so straightforward for some unfortunate students!

2 Responses to Parsing panic

[quote]A really easy way to parse and sort text is by using Excel + VBA. Since ScraperWiki can export to CSV, I downloaded the CSV files & wrote a small macro to walk through the words and count instances of each of them.[/quote]
Hmm I guess people are still more comfortable performing the manual steps necessary to get to familiar Excel-land. You could do it all on scraperwiki by writing the data to a database table and using a one-line query:
“select distinct word, count(word) order by count(word)”
to get your results