Translate

Thursday, March 01, 2012

“Facts are Sacred”, The Guardian Datablog says, and the Guardian challenges us to examine those facts carefully, as its recent story on the "3 Little Pigs" illustrates.

While creative visualizations of indicator data undoubtedly make it more interesting, without informed explanation of what the information means, the visualizations can often be misleading. The Guardian Data Store and the Guardian Datablog provide an example of how journalists bring images, indicator data and intelligent analysis together.

Limitations: The raw data searches work only intermittently, and the Kindle format reduces the utility of the Ebook.

Who this is for

Not everyone will be able to produce the kind of data visualizations presented by the Guardian, but we can all benefit by the examples the news site provides of where to get data, how to present it, and what kind of questions are necessary in determining the reliability and validity of indicator data.

Those who may have the resources to actually produce these kind of visualizations will probably not be working at a project level. But aid agency communications staff, and some large agencies at the country level, may have the intellectual and financial resources to do what the Guardian data team does.

Background - data visualization

Many development workers, project managers, monitors, evaluators, wading through dense international development project reports and evaluations, may wish that more attention could be paid to making indicator data more understandable.

The late Hans Rosling, for example, is well known for creative presentations on indicators.

These are intended, as the Gapminder Institute puts it, to “unveil the beauty of statistical time series by converting boring numbers into enjoyable, animated and interactive graphics”. But doing this at the project and programme level in international development can be challenging.

And as entertaining as animations, graphs or pictures are, by themselves they can be misleading. Hans Rosling does more than provide the graphics, of course. He interprets what the data and the graphics can tell us, in clear, compelling language.

The Gapminder Institute is not alone, however, in presenting – and interpreting - indicator data in a compelling manner.

The most interesting part of the Guardian site, (aside from 3 Little Pigs ad) for many of us working in international development is the Guardian Datablog, and the associated Data Store, its directory of all of the statistics the Guardian uses as it reports the news. This includes World Government data search, a Development data search, examples of featured data visualizations and a link to an electronic version of Facts are Sacred, a new book by Simon Rogers, one of the Guardian’s news editors, on how The Guardian collects and presents data,. He is also editor of both the Data Store and Datablog.

The Guardian itself, in its main economic, political, health or education pages publishes mainline news stories. What the Datablog does, as far as I can see, is highlight the stories making innovative use of publicly available data, explain where the data come from, and then challenge readers to question it, or do more with the information. In some cases the Guardian Datablog appears to produce its own visualizations from raw data sources, but in most, it seems, the Datablog team provide a link to, or a variation on another agency’s visualization – and then they provide The Guardian’s explanation of what the information means.

Scope of the Guardian’s data

The Manchester Guardian, one of the UK’s oldest newspapers, was founded in 1821 and established its online site almost 175 years later, in 1995. The Guardian Datablog was established in 2009 to explore what editor Simon Rogers refers to as data journalism – journalism which mines available data looking for hidden or emerging stories.

As he points out, data journalism is not new – good journalists have been using obscure data as the basis for breaking news for centuries.

The Datablog, between January 15, 2009, and the time it produced a summary of all of the blog’s comments on data journalism, on July 15, 2011, listed 1,407 articles or blog posts, including links to all of the available underlying data in spreadsheet format. It is unclear why that information had not been updated since it was posted in 2011, but by my very rough estimate as I write this in March 2012, there have been approximately 490 - 500 additional posts between August 2011 and February 28, 2012. This brings the total to approximately 1,900 articles - all with some form of graphic – tables, charts, static or interactive, simple or complex, and all challenging us to use the underlying data ourselves in whatever way makes sense to us.

This is a mind-boggling number of analyses, given the detail involved - over 10 such articles a week for two and half years – and it does not necessarily represent the true extent of the work The Guardian has done.

As just one example, one story, by The Guardian Data editor about Malaria, referenced in the Datablog on February 3, 2012, for example, was preceded and followed within a week, by at least three other stories on Malaria in the main section of the Guardian (by the health editor), in the Guardian Weekly, and in the New Review section of the Guardian’s sister publication the Observer.

Old data - new presentations

An interesting illustration of both the history of data journalism, and of how the Guardian gets - and uses - data can be found in a Datablog post of September 26, 2011. The first edition of the Guardian, in 1821, had a story on school funding and attendance, based on leaked data, and in 2011 the Guardian Data Post used the same data, repackaged using today’s technology.

The original story - as we would expect within the constraints of the day’s technology – used print, and tables. But the 2011 review presents the same data using the free IBM ManyEyes software in a form which allows users to manipulate the data, sort it and see the results in different formats.

And going further, the Datablog provides the data in spreadsheet form, so readers can download it, and analyse it using any programme they think will produce interesting variations on both presentation and insight.

Indicator data in context

But visualizing the data is one thing and explaining it is another. I have seen project reports with fine graphics, but confusing narratives, and incomprehensible results.

Putting data in context is something the Guardian does well in most of its reporting, using, for the most part, plain, clear language to explain what the charts and graphs mean. In this particular post on education in 1821, the importance of the story, and the context, are explained in an accompanying analysis to the 2011 Datablog post.

It is worth noting that the url – the web address – for the Datablog is a subset of the news url: "http://www.guardian.co.uk/news/datablog", and its editor is a news editor of the Guardian.

“Data journalism is not graphics and visualisations. It's about telling the story in the best way possible. Sometimes that will be a visualisation or a map (see the work of David McCandless or Jonathan Stray).

But sometimes it's a news story. Sometimes, just publishing the number is enough.
If data journalism is about anything, it's the flexibility to search for new ways of storytelling. And more and more reporters are realising that. Suddenly, we have company - and competition. So being a data journalist is no longer unusual.

It's just journalism."

Finding the indicator visualizations relevant to development

But with almost 2,000 Datablog examples of data visualization– and probably several thousand accompanying news stories - how do you find those related to development? Most of the stories and visualizations are in fact UK-oriented. But there are dozens with direct relevance to development workers, project managers and aid agencies.

Unfortunately, for a site so focused on data accessibility, narrowing down the huge amount of material to specific development topics with data visualizations is not something that is always intuitively easy on the Guardian site.

Readers going to the DataStore site which provides links to all of the data resources, will see, at the top, a list of the 5 or 6 most recent posts from the Datablog, and a link to older posts. Those older posts are listed at roughly 15 per page, going back apparently as far as 2009, which would mean clicking manually (or digitally) through 125 or so pages.

Given the nature of what is there, I can recommend it as an entertaining manner of passing time, but for readers looking for something specific, it is far too time consuming to be workable.

1. Searching for development news

Searching for news stories relevant to development issues is relatively easy: From the main Guardian page clicking on "Development" takes the reader to the Global Development section.

This has a dedicated search box on the right, at the top.

Copyright – Guardian News & Media Ltd.

Hundreds of news stories can be obtained on development topics through this search. Some of these include maps, charts or interactive visualizations of underlying data, although most do not.

2. Raw indicator data

Raw indicator data - with no accompanying interpretation or visualization - is also available, from time to time, for those who want to sort and interpret the data themselves. There are search boxes for both World Government Data and for Global Development Data on the site. Both were difficult to use at the beginning of 2012 and while you may get lucky and get data now, as I did earlier today, it is not consistent.

It is rare to find news editors, or their teams, on a international site such as this, who actually reply to readers comments, let alone to their emails. But a major strength of the Guardian is the responsiveness of the data team. After pointing to editor Simon Rogers my problems with search I found, within hours, that the Guardian data searches did start to work, although that success proves short-lived. A few hours later, the search boxes were not functional.

13,000 sets of raw indicator data - when the "search" works

When it works, and this seems intermittent, the World Government Data search provides links to raw data, usually in spreadsheets but occasionally in other formats, data which users can download and manipulate for themselves. There are roughly 13,000 such data sets sometimes available through this Guardian portal, from US, UK, Australian, Canadian, New Zealand and Spanish governments, on at least a dozen topics such as agriculture, education, environment, health, or population.

The Global Development Data search, which also works only intermittently, is slightly more difficult to find on the Data Store page, but when it does work, it provides links to similar types of data - over 5,000 sets in all, primarily from DFID, the World Bank and the UN Office for Coordination of Humanitarian Affairs. These are, again, usually data provided in spreadsheet or xml format which can be used (by those with the technical skills) with free software available online, to make new visualizations.

MDG data

What I found particularly interesting about the way this is organized, is that when the data search is functional, we can sort the indicator data by agency - but also 151 of the data sheets are sorted by their relevance to the different Millenium Development Goals, and 80 in terms their relevance to Millenium Development Targets.

There are, so far, no links that I could find to other donor agencies such as AusAid, CIDA, or USAID, but these may be added as those agencies, some of which do make data available on their own sites, see the utility of the Guardian links.

It is genuinely frustrating, however, to find such a potentially useful tool that is also this unreliable.

3. Indicator data visualizations

Getting to the lighter and more entertaining part of the data search --the development indicator data visualization -- is more time consuming, but also more reliable than the raw data search. There is no direct search box for such visualizations, but there are, however, at least 4 ways to find the stories and the data visualizations that may be of interest to readers:

A Graphics Link - For readers who don’t want to look at spreadsheets but just want the stories and more interesting graphics, 694 of these can be found, at the graphics link

A Directory - Readers can go to the bottom of the DataStore page, where there is a form of directory, a list of 8 categories, which cover 49 sub-topics – and, apparently, all 1,900 posts since 2009.

Click image to enlargeCopyright – Guardian News & Media Ltd.

Under “World” for example, there are 10 sub-categories, one of which is “Development Data”. Clicking on that will take us to 6 pages with roughly 80 stories. These include, among others just since November 2011, articles and graphics on

There are hundreds more available also, from the period prior to November 2011.

An Alphabetical list - Readers can go to the A-Z data search at the top of the Data Store

Copyright – Guardian News & Media Ltd.

This link gives us a list of roughly 180-200 topics. Some development-related topics are listed here, alphabetically, but others are not.

“Literacy”, for example, is listed, providing 5 stories, one of which is directly relevant to development work, the rest focused on the UK.

The Malaria stories, however, are not listed alphabetically here, but are included among 70 other stories listed under “health”, most of which are focused on the UK. So, using this method requires a bit of attention and some lateral thinking.

A Data spreadsheet - The fourth method of finding data stories and visualizations, at least as far as I can see, is to go to the July 2011 post “All of our data journalism in one spreadsheet”, which I referred to at the top of this review. With its 1,407 posts, this has more than enough to keep anybody busy, and entertained, for weeks.

If it were up to date, it would be even more useful.

What is particularly useful about this database of stories, even in its current state, is that readers can sort the available stories

By title, alphabetically,

By whether they have downloadable spreadsheets with the original data (over 1,100 do have the original data),

By the number of times the story was referenced by users of Twitter in their “retweets” to others,

By the number of online comments.

The default for the material is from most recent (August 2011) to earliest (January 2009) but the list is not easily resorted this way.

Links to the original visualizations

Where the Guardian does make use of other people’s graphics and visualizations the Datablog provides links to the original sources.

Going to those original sources can sometimes provide more arresting and more interactive data visualizations than the Guardian itself presents.

Putting a critical eye to data – a challenge, not a limitation

The Guardian makes most of the data it works with available to its readers in the spreadsheets. But as the Guardian itself notes about the underlying data, for its World Government Data Store ,

“Please keep in mind that the data provided in the Guardian's World Government Data API is aggregated from the source sites we are tracking and is provided on an "as is" basis. That means we do not check the accuracy or completeness of the data, nor are we able to grant you a licence to use it. If you wish to republish any of the data, you will need to check that such reuse is permitted by the source site, by following the link guidelines and usage terms and conditions on each site . You are solely responsible for what you publish.”

I have been described by one frank colleague as someone working with "a distinct spreadsheet dysfunction", and I do not think I am alone in this affliction among those who will be looking at the Guardian site. Even given the resources of the Guardian, there are occasionally problems – either with the way the information is explained or, perhaps, simply in how people like me understand (or fail to understand) the explanations.

Take, for example, the August 2, 2011 post on the U.S. debt ceiling . It makes some interesting points about who has been in power during the periods when the ceiling was raised, but actually looking at the underlying data I found at least two things confusing about it, which may, however, have been clear to other readers:

There are references to what the net debt will be in 2013 and 2014, but it was unclear to me where those dates came from. I assume it was from budget projections, but the fact that the figures in the table for the debt ceiling – and GDP for that matter – in 2014 are substantially lower than 2011, is information that could, one would think, justify some explanation. Clicking on the link to download the full spreadsheet, however, takes us not to data on the debt ceiling but to a spreadsheet on who holds US debt – the subject of a different, November 22, 2011 Datablog post.

The second, and more minor issue, was a matter of proof-reading. One line in the post refers to the fact that “Ronald Reagan increased the debt ceiling by 23 times”, which would have put the debt ceiling at somewhere near 40 trillion dollars when he left, when what I think was meant was that he increased it on 23 separate occasions (from roughly $2 trillion to $4 trillion).

This, of course, is the point of the challenge that the Guardian Datablog puts at the end of most of its posts:

“You can download the full data below. What can you do with it?”

Limitations – Ebook formatting

I started out a month ago, on this post, with the intention to review the Kindle version of Simon Rogers’ book: Facts are Sacred: The Power of Data which explores in more detail than the web pages how the Guardian gets and uses data. This book costs roughly $4, a small price given the contents, but while it is interesting, it is not nearly as useful as the Datablog and Data Store sites, which are, after all, free.

The limitations of the book are not derived from the information, but appear to be inherent to the Kindle, and perhaps other Ebook formats. I have yet to meet anyone working in international development who has a Kindle – and only one who uses an Ipad or Apple desktop. Of the roughly 90 subscribers to this blog, most of whom are in Asia, Africa or Latin America, only two appear to access it from a device using an Apple operating system. But Kindle and iPad versions were, in February 2012, the only two formats in which this Ebook was available.

I know people who do use the Kindle in North America, for recreation, and some of them think highly of it for these purposes. Readers like me, who don’t have the Kindle device, but want to read the Facts are Sacred book can, indeed, as I did, download from the Amazon website free software which will permit us to read a Kindle publication on our desktop computer, after we have purchased it. But I found that the formatting, the relative paucity of links in this version of the book, the number of dead links, and the difficulties in copying text if we want to reference material for research or reporting, all make the Kindle format of very limited utility to those of us who want to use it for professional, rather than entertainment purposes.

There are publishers, however, who make Ebooks available in PDF format. Although as Simon Rogers writes, the PDF format may be “the worst format for data known to humankind” my experience is that PDFs are superior - to the Kindle format at least - for those reading reports for professional purposes. For Luddites like me, being able to see and work with recognisable pages and consistent formatting is comforting.

While the Ipad version of this particular Ebook may be more interactive than the Kindle version, very few people working in the field on development projects can afford one, and most, in any case, use Windows-based computers.

Some publishers produce Ebooks that can be used by anyone with a computer. O’Reilly Books, for example, produces Ebooks in formats which can be viewed on the IPAD, the Kindle, personal computers, smartphones or other devices, using both formats specific to those devices, and in PDF format. Readers can annotate, copy and print the material, and get updates to outdated books. The prices for any individual publication are significantly higher than most of those available for the Kindle on Amazon’s site – but we can spend $15 for an Ebook from O’Reilly and read it on our existing computer, or pay $150-$300 for a Kindle or Ipad, then pay $3 for the book, and face the Kindle and Ipad-specific problems of trying to use it for reference.

Nevertheless – aside from the issues of layout and links, the Facts are Sacred book makes some interesting points about how the Guardian obtains and interprets data. If it comes out in a more usable format, it would be a useful companion to the Data Store and Datablog sites.

Given the Guardian’s commitment to open data, it would be helpful to see this available (even behind a paywall if necessary) or in another format, so people with limited resources can read it. It is welcome news that the book will be available in paper, but it would be even better if a more accessible electronic version were produced.

The bottom line

Anyone who wants to see how creatively indicator data can be presented will enjoy the Guardian site, and could easily spend hours exploring what is available. But the processes of transforming mundane data sets into dynamic interactive presentations requires the kind of resources (of time, technical expertise and money) that most individuals and small organizations do not have.

In any case, as the Guardian sites make clear, visualizations are not the end of the story. Checking data for validity, reliability and context are as essential to journalists as they are to all of us as we try to make our reporting credible.

Greg Armstrong is a Results-Based Management specialist who focuses on the use of clear language in RBM training, and in the creation of usable planning, monitoring and reporting frameworks. For links to more Results-Based Management Handbooks and Guides, go to the RBM Training website.

About Me

Greg Armstrong brings what we know about how adults learn to helping international development workers use Results-Based Management in their work. If it is done right, it can be enjoyable, and productive, helping us explain our work to others.