Behind the Scenes at the Guardian Datablog

Figure 17. The Guardian Datablog production process visualized (The Guardian)

When we launched the Datablog, we had no idea who would be interested in raw data, statistics and visualizations. As someone pretty senior in my office said: “why would anyone want that?”.

The Guardian Datablog — which I edit — was to be a small blog offering the full datasets behind our news stories. Now it consists of a front page (guardian.co.uk/data); searches of world government and global development data; data visualizations by from around the web and Guardian graphic artists, and tools for exploring public spending data. Every day, we use Google spreadsheets to share the full data behind our work; we visualize and analyze that data, then use it to provides stories for the newspaper and the site.

As a news editor and journalist working with graphics, it was a logical extension of work I was already doing, accumulating new datasets and wrangling with them to try to make sense of the news stories of the day.

The question I was asked has been answered for us. It has been an incredible few years for public data. Obama opened up the US government’s data faults as his first legislative act, followed by government data sites around the world — Australia, New Zealand, the British government’s Data.gov.uk

We’ve had the MPs expenses scandal — Britain’s most unexpected piece of data journalism — the resulting fallout has meant Westminster is now committed to releasing huge amounts of data every year.

We had a general election where each of the main political parties was committed to data transparency, opening our own data vaults to the world. We’ve had newspapers devoting valuable column inches to the release of the Treasury’s COINS database.

At the same time, as the web pumps out more and more data, readers from around the world are more interested in the raw facts behind the news than ever before. When we launched the Datablog, we thought the audiences would be developers building applications. In fact, it’s people wanting to know more about carbon emissions or Eastern European immigration or the breakdown of deaths in Afghanistan — or even the number of times the Beatles used the word “love” in their songs (613).

Gradually, the Datablog’s work has reflected and added to the stories we faced. We crowdsourced 458,000 documents relating to MPs' expenses and we analyzed the detailed data of which MPs had claimed what. We helped our users explore detailed Treasury spending databases and published the data behind the news.

But the game-changer for data journalism happened in spring 2010, beginning with one spreadsheet: 92,201 rows of data, each one containing a detailed breakdown of a military event in Afghanistan. This was the WikiLeaks war logs. Part one, that is. There were to be two more episodes to follow: Iraq and the cables. The official term for the first two parts was SIGACTS: the US military Significant Actions Database.

News organizations are all about geography — and proximity to the news desk. If you’re close, it’s easy to suggest stories and become part of the process; conversely out of sight is literally out of mind. Before Wikileaks, we were sat on a different floor, with graphics. Since Wikileaks, we have sat on the same floor, next to the newsdesk. It means that it’s easier for us to suggest ideas to the desk, and for reporters across the newsroom to think of us to help with stories.

It’s not that long ago journalists were the gatekeepers to official data. We would write stories about the numbers and release them to a grateful public, who were not interested in the raw statistics. The idea of us allowing our raw information into our newspapers was anathema.

Now that dynamic has changed beyond recognition. Our role is becoming interpreters, helping people understand the data — and even just publishing it because it’s interesting in itself.

But numbers without analysis are just numbers, which is where we fit in. When Britain’s prime minister claims the riots in August 2011 were not about poverty, we were able to map the addresses of the rioters with poverty indicators to show the truth behind the claim.

Behind all our data journalism stories is a process. It’s changing all the time as use new tools and techniques. Some people say the answer is to become a sort of super hacker, write code and immerse yourself in SQL. You can decide to take that approach. But a lot of the work we do is just in Excel.

Firstly, we locate the data or receive it from a variety of sources, from breaking news stories, government data, journalists' research and so on. We then start looking at what we can do with the data — do we need to mash it up with another dataset? How can we show changes over time? Those spreadsheets often have to be seriously tidied up — all those extraneous columns and weirdly merged cells really don’t help. And that’s assuming it’s not a PDF, the worst format for data known to humankind.

Often official data comes with the official codes added in; each school, hospital, constituency and local authority has a unique identifier code.

Countries have them too (the UK’s code is GB, for instance). They’re useful because you may want to start mashing datasets together and it’s amazing how many different spellings and word arrangements can get in the way of that. There’s Burma and Myanmar, for instance, or Fayette County in the US — there are 11 in states from Georgia to West Virginia. Codes allow us to compare like with like.

At the end of that process is the output; will it be a story or a graphic or a visualization, and what tools will we use? Our top tools are the free ones that we can produce something quickly with. The more sophisticated graphics are produced by our dev team.

Which means we commonly use Google charts for small line graphs and pies, or Google Fusion Tables to create maps quickly and easily.

It may seem new, but really it’s not.

In the very first issue of the Manchester Guardian, Saturday 5 May, 1821, the news was on the back page, like all papers of the day. First item on the front page was an ad for a missing Labrador.

And, amid the stories and poetry excerpts, a third of that back page is taken up with, well, facts. A comprehensive table of the costs of schools in the area never before “laid before the public”, writes “NH”.

NH wanted his data published because otherwise the facts would be left to untrained clergymen to report. His motivation was that: “Such information as it contains is valuable; because, without knowing the extent to which education … prevails, the best opinions which can be formed of the condition and future progress of society must be necessarily incorrect.” In other words, if the people don’t know what’s going on, how can society get any better?

I can’t think of a better rationale now for what we’re trying to do. Now what once was a back page story can now make front page news.