in a word...Currents in Australian affairs, 2003–2013

About

What can you read in a single word? Drawing on details of ABC Radio current affairs programs stored in Trove, this page presents a word a month from AM, The World Today, and PM.

Which words? The selected words are those that seem most distinctive, based on a statistical measure called TF–IDF (Term Frequency – Inverse Document Frequency). It's a fairly simple method with numerous problems, but this isn’t intended as a rigorous statistical analysis.

The aim of this page is to provide a reminder of the people, events, stories and places that seemed significant at the time, but might now have faded from memory. It’s just one possible way of exploring the rich store of ABC data available through Trove. Hopefully it will inspire others to dig deeper.

October 2013

November 2013

December 2013

Data

Trove, The National Library of Australia's discovery service, harvests the details of 54 ABC Radio National programs. Currently there are over 200,000 RN records in Trove, including details of every segment of ABC Radio's current affairs programs – AM, PM and The World Today – broadcast since 1999. This is an important record of Australia's recent social and political history.

All of this metadata is available for re-use through the Trove API, opening it up for new forms of analysis and exploration.

Building a harvester to grab data from Trove is pretty straightforward. A while back I created a simple Python TroveHarvester class that handles all the basics. You just need to subclass this, replacing the process_results() method with something more useful.

The harvester I used to build a local copy of all the Radio National data is available on GitHub. It simply queries the Trove API and saves the results to a MongoDB collection.

API results are returned at the ‘work’ level, and a single work may include multiple versions. In the case of the Radio National data, the records of some regular segments have been grouped together as works because they have the same title and creator. To make it easy to get at all the individual records, the harvester opens up each work and extracts and saves the metadata for all the versions inside. As a result, the total number of records saved by the harvester will be greater than the number of work-level search results returned by Trove.

Method

TF–IDF (Term Frequency – Inverse Document Frequency) provides a measure of how significant a word is in a particular document by comparing it’s frequency within that document to it’s frequency within a collection of similar documents. Words that are common across all documents will have a low TF–IDF value, while words that appear frequently in just a small subset of documents will have a high value.

I’d previously played around with TF-IDF during my Harold White Fellowship at the National Library of Australia. Using a set of 10,000 newspaper articles harvested from Trove, I created The Future of the Past, which used words with high TF–IDF values as a way of navigating through time (and creating your own tweetable fridge poetry!).

TF-IDF is used all over the place by search engines to calculate the similarity of documents, but what entranced me was the evocative power of the words themselves. They seemed to tell a story...

For In A Word I wanted to try and capture the ebb and flow of current affairs – particularly, those topics that seemed to capture our attention for days or weeks and then disappear. TF–IDF seemed quite well suited to identifying what was different over time.

In this case a ‘document’ is actually the combined titles and summaries of every segment of a program for a month. The TF–IDF values are calculated by comparing this ‘document’ to every other month from the complete decade, 2003–2013.

I used the Python Natural Language Toolkit (NLTK) to actually perform the calculations. So the basic method for each program went something like:

Loop through program records by year/month, writing titles and summaries for each month to a separate text file

Create a corpus of all the text files using NLTK

Loop through the corpus file by file, breaking the content up into individual words (known as tokenising)

Loop through the words calculating a TF–IDF for each (ignoring common 'stopwords', just to speed things up a bit)

Select the word with the highest TF–IDF for each month and write them to a data file to use in the web interface

Problems

You might be wondering about the word ‘Documentary’ that turns up in the PM column a few times in December and January. What’s going on? In the holiday period, many of the stories broadcast on PM included the words ‘Documentary Special’ in the title. So the word ‘Documentary’ is really common in December and January, but pretty rare throughout the rest of the year. As a result it wins a good TF–IDF score.

Similarly, you might be straining to recall some of the names that turn up in PM – Alle, Glanville and Woolrich for example. Were they politicians, criminals, corporate big-wigs...? They’re actually business and finance reporters. You’d generally expect that the names of journalists would be common enough across the whole corpus that their TF–IDF values would be fairly low. But PM seems to have turned over business and finance reporters on a regular basis, so they're very prominent for relatively short periods. And that means they score well.

So why did I leave these oddities in? I thought about simply adding words like ‘Documentary’ to my list of stopwords so they’d be ignored. But what if there really was a case where a documentary had become controversial? Similarly, I looked at ways of extracting the names of journalists from the program metadata so I could exclude them from the calculations. But it was hard to separate journalists from the subjects of their stories.

In the end I thought that instead of massaging the data it was better to leave the problems in, and make the limitations of this method and the complications of the data obvious. Perhaps I’ll create a cleaned up version in the future for comparison.