Tagged: OpenPlatform

One of the things I try to impress on folk whenever I do a talk about web stats is how showing charts of mean (that is, simple averages) numbers of visitors or time on site is all but useless, because the actual distribution of values is not likely to be normal, and so the simple averages reported are, too all intents and purposes, not good for anything.

So when putting together a change treemap to summarise the changes in funding of quangos form government departments (Visualising Where the Money Goes: Westminster Quangos, Part 1), what I did originally was to visualise the change in funding for each quango within each department, rather than just displaying the following simpler, overall funding by government department treemap that I posted about previously:

The reason being? The overall change treemap might show increases or decreases in expenditure for a department as a whole, but it doesn’t reveal whether the funded quangos were all treated equally, or whether lots of small quangos received a cut whilst one big one received a huge rise in funding, for example:

So how did I create this treemap? The simplest way is just to create a query on the original spreadsheet that pulls in 4 columns – department, quango, and two expnditure columns (one for 2006 and one for 2007). A query built around this kernel, in fact:

The CSV feed can then be pulled into a Many Eyes Wikified and visulaised in a variety of ways that reveal both the overall expenditure of quangos funded from within each department, as well as the relative funding over 2006/2007 of each quango:

So for example, for the two biggest quangos by expenditure in Defence, one spent more over 2006/7, and the other spent less…

Using the same data page , we can create other visualisations within the Many Eyes Wikified environment that allow us to explore the data in a little more detail. So for example, here’s a bit more detail about the funding of quangos in the Department of Health. Parallel bands show that quangos spent equivalent amounts in 2006 and 2007, lines that diverge going from left to right show an increase in expenditure, and lines that converge going from left to right depict decreasing expenditure.

A scatter chart readily shows large changes in expnditure over the two years:

And some charts are just shiny ;-)

Compared with just trawling through the spreadsheet data, I think there is a lot to be said for using visualisations to identify out of the norm data points (e.g. using the scatterplot), or unpacking totals (as in the case of departmentally funded quango expenditure) in a quick and easy way as a starting point to exploring the data in a more systematic way, for example in order to ask journalistic questions (whatever they might be).

My tweetwatching has been a little sporadic of late, so I haven’t really been keeping up with the data sets that keep being posted to the Guardian Datablog, but today I had a quick skim through some of the recent uploads and had my eye caught by a post about funding of UK government quangos (Every quango in Britain [spreadsheet]).

What I’ll show in this post is how to construct a query on one of the quangos data sheets that can then be visualised as a change treemap, showing at a single glance how funding over all the quangos (arranged by sponsor department) has changed over the period 2006-2007.

The quangos spreadsheet is split into several different areas and at face value is quite hard to make sense of (what’s the difference in status (and provenance of the data) between the NHS and Health quangos, for example?

But I take nothing if not a pragmatic view about all this data stuff, and whilst there may be, err, issues with doing “proper” data journalism with this spreadsheet, I think we can still get value from just seeing what sorts of technology enhanced questions we might ask of this data, such as it is (as wll as identifying any problems with the data as it is presented), and as a result maybe identifying various issues with how to present and engage with this data were it to be republished again.

As ever, my doodles don’t properly acknowledge the provenance or source of the data, nor do I try to make any sense out of the data or look for any specific ‘meaning’ within it – I’m still at the stage of sharpening my technology pencil and seeing what sorts of marks it can make – but this is something I know I don’t do, and will be something I start to look at somewhen, honest!

So let’s make a start. To provide a bit of context, the questions I set out with when doodling through this post were:

1) is the data clean enough to run summarising queries on the data (that is, queries that summed totals for different departments)?
2) is the data clean enough to not break Many Eyes Wikified if i pass it to that visualisation tool via a CSV feed?

And a matter arising:

3) how do I write a query that specifies column headings (the headings in the spreadsheet leave something to be desired, at times….)?

The spreadsheet I chose to play with was the Westminster sheet, which you can find from here: UK Quangos [Guardian Datastore] (you’ll need to select the appropriate tab).

Just by looking at the data in the spreadsheet we notice a couple of things, things that suggest certain queries we might run on the data. (I probably need to unpack that phrase at some point (“run certain queries”) but the essence of it is this: if we treat the spreadsheet as a database, so sort of reprts can we generate from it; typically, in a database environment, reports are generated by running queries using some sort of database query language, which in the case of Google spreadsheets is the SQL like Google Visualisation API Query Language.)

So, the first thing I noticed are the two columns on the left – Government departments and presumably the quangos sponsored by those departments. And what these suggested to me were that I should be able to generate reports that summarise expenditure over all quangos in each department. (Whether or not this is interesting, I don’t know; but it’s something that we should be able to do easily enough, and it may spark off other questions in our mind).

The second thing I noticed was that lots of the data straddled two years (2006 and 2007)

And finally, gross expenditure seemed like a meaningful quantity and maybe least open to contention, so I decided to pick on that as the quantity to sharpen my pencil with:

To start playing with spreadsheets, I bookmarked it so that I could play with it in my Data Store explorer (note that I needed to specify the sheet number, where the first sheet is sheet 0, the second is sheet 1, and so on; so the Westminster sheet (the third sheet form the left in the spreadsheet) is sheet 2):

When we preview the column headings, (which the API assumes are in the top row, I think?), we get – not a lot…

If we scroll down in the data store explorer, we get at least the spreadsheet column labels:

Anyway, let’s try to run a query that summarises the overall gross expenditure for 2006 (column R) and 2007 (column S) for each department (column C):

The query is encoded as:select%20C%2Csum(R)%2Csum(S)%20group%20by%20C%20order%20by%20sum(R)%20desc

That is:select C,sum(R),sum(S) group by C order by sum(R) desc

So we select three columns, and for each column, group the rows according to department (column C), display the summed value over those grouped rows for columns R and S, and presenting them in descending (desc) order by column sum(R):

If we click on the HTML preview link, we can view the table in its own page:

If you preview the CSV, you’ll notice there’s a problem with it though:

There are rogue commas everywhere, appearing within the ‘numerical’ data, and this file is supposed to use commas to separate out different elements. To get proper numbers out, we need to set their format, which means adding something to the end of the URI:

format sum(R) ‘#’,sum(S) ‘#’

(Not that you do need to encode the #s by hand, as %23)

That is, present sum(R) and sum(S) in a proper numerical format:

So there we have summary tables showing the expenditure for each government department. Many Eyes Wikified isn’t letting m import that data directly via CSV at the moment, but it’s easy enough to download the CSV and copy the data into a Many Eyes Wikified data page:

That is:

(Casting an eye over the CSV data, there are also a couple of glitches in it that upset the grouping – eg “Home” and “Home ” (trailing space) are treated differently.)

We can now use this page to create some visualisations – so put some placeholders in place in a visualisation page:

One of the potentially most informative summary views for this data is a change treemap, that shows the relative gross expenditure for each department, along with whether it has increased or decreased between 2006 and 2007:

Last week, Steph Gray announced yet another innovative way of trying to engage people in public consultations in his blog post Your starter for ten. The piece describes a scheme in which a series of pub quiz style “killer facts” are pulled out of a current consultation document on consumer rights and credit and then represented in a quiz format along with why you should care/what the consultation is seeking to do to address the issues raised by each quiz question. (You can find the quiz from a link on here: Government action to secure a better deal for consumers.)

This idea, of microchunking particular elements of a consultation and then trying to use these microchunks to draw people into commenting on a consultation document, is one that Joss Winn and I have casually explored in the context of WriteToReply. In that case, we discussed whether or not we should pull out intriguing facts or potentially contentious questions that we could then tweet, along with a link to the appropriate part of the consultation document, in order to entice people into commenting, either directly on the WriteToReply site, or by remote commenting (that is, posting a blog comment or tweet that links back to a particular paragraph on the WriteToReply site site that we can then track via a Trackback).

(As part of this, we imagined creating a list of ‘nuggets’ pulled from consultation docs as we imported them into WordPress; it strikes me now that if we did have such a list, we could set up a twitter account for each consultation that could be run on a ‘daily feeds’ like basis – whenever anyone subscribes, they start to receive tweets @’d to them, according to a personal schedule starting at the moment they follow the consultation, as well as more general broadcast tweets?)

One thing to note here is that rathr than linking to the actual paragraph that contains the question, which is what we’d normally do, these tweets link to paragraphs that preview, and provide the context for, the questions. So if you follow the link, you are lead into the body of the consultation document, and if you read on you then come to the question included in the tweet. That is, the tweet provides the question that sets the contest, the link leads through to the part of the consultation that provides the context for the question, and then to the question as it appears in the consultation doc.

Also on Twitter, Joss and I fell into a conversation with Steph and Richard Stirling about the different audiences for consultation docs and what the appropriate means of publication are for those different audiences. So for example, Steph suggested “Consultations have multiple audiences. Suspect downloadable PDFs actually not bad for policy folk. But for public?” [ref], which was backed up by Richard: “I agree with @lesteph’s point. As a policy person I often want to read the whole doc – not sections. PDF works.” [ref].

However, if the aim is to reach outside the policy wonks and the committed lobbiests/interest group members, then I suspect we need smaller ‘headline’ chunks, or atomic parts of the consultation document, to pull people in to the consultation. (Also, we may learn something form the journalists here, and the way they construct stories to lead people in, or at least, give them some of the facts – that is, facts they can misquote in the pub later! – up front.)

There are dangers with the headline approach, of course, as the ‘simplistic’ tweeted questions shown above suggest… At the simplest reading, they just solicit a trivial yes/no answer, rather than an informed comment. But bear this in mind too – those questions were taken from the consultation document itself.

A further thing that’s interesting to note is how the consultation document is actually constructed. The ‘argument’, such as it, and the issues that the consultation wishes to be taken into account, are used to preface the actual questions (see the sections on Potential Sources of Top-Up Funding or Protecting the BBC’s Funding for a couple of examples).

That is, some issues a presented, and then the question is asked. But how likely is this to work as an engagement strategy? A cold start conversational strategy would probably be more likely to start with a question, followed by a discussion (or argument) and an agreement to disagree.

and also by using the questions to lead in to some of the discussion that actually appears before the questions in the original consultation document:

We’ve also started looking at pulling related news stories in to the dashboard, in the first instance from the Guardian using the Guardian OpenPlatform API, to try and embed the consultation in a wider context:

There is an issue of circularity here, of course – the news reports presented to date stem largely from responses to the original consultation call, so rather than setting the consultation in context, you could argue they are just responses to it.

Cf. also the approach taken particularly on BBC sites where full articles on a government documents are often backed up with a link to the original document:

But we have to start somewhere, and we are, after all, making this stuff up as we go along. If nothing else, we are exploring how to re-balance the presentation of the consultation doc and associated news stories compared to the mode of presentation used by the BBC et al.

And finally (and slightly off topic!), note that we’re also using WordPress feed to pull in both the content of the report and the comments from the WriteToReply republication of the original consultation document:

However, whilst we can pull the content of the report into the dashboard via an RSS feed, the paragraph level links and links and comment links are not passed though the RSS:

(I suspect this is because the linking is managed by the CommentPress theme? Joss – maybe we need to look at adding paragraph and “comment here” links to the RSS content too?)

When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).

In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google’s Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”

I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.

But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.

Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.

So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?

As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:

Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)

This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.

So what does Redmond have to say about this (p.55)?

– Find and fix errors
– Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).

(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)

Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)

So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?

PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.

PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:

Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.

So now suppose I want to pull in data from the Guardian spreadsheets. I can use the name as a key and construct a URI that will query the Guardian spreadsheets by MP name. An easy way to gt an example query is to use the Guardian Datastore Explorer:

That is, we construct the URI around the name contained in the cell a couple of columns to the left, (data that was itself pulled in from a Tory party spreadsheet).

(Note that I’m generating an HTML preview, but you should be able to grab CSV too: just set tqx=out:csv).

Drag the cell down to generate URIs for all the Shadow Cabinet:

We can now import data into the spreadsheet using that constructed URI:

Remember, I generated an HTML preview table URI. (If you generated a CSV URI, use =importdata(C2). Note I was using CSV originally, but then it seemed to stop working, so I switched to HTML).

Here’s what you get when you copy that (relative) formula for each MP:

Okay, so there’s an issue here – we’re only getting very other MP because the table headers are being pulled in (the same happens with CSV – I couldn’t find a way to get the importdata formula to suppress the header?). A clunky workaround you be to create another set of columns that pull in the other alternate rows (i.e. start pulling in data from row 3 rather than row 2).

You’ve also got the header crap everywhere, but I’m guessing we could probably write a query language call that would pull out the data from this spreadsheet and ignore the rows containing the column headings.

Anyway, so what?

So – I can create a data square in a Google Spreadsheet that pulls in and essentially combines data from multiple spreadsheet data source that I know and trust, as long as there is a common key query term (name in this case).

Prompted by a couple of comments from @Josswinn to be more transparent (?!;-), here’s a glimpse into how I set about learning how to do something.

It potentially won’t make a lot of sense, but it’s typical of the sort of process I go through when I hack something together…

So, there’s usually an initiator:

Then there’s a look to see if there’s anything to play with that is on my current list of things I thing I’d like to, (or need to;-), know more about:

Hmm, ok… Google spreadsheets. I’ve just learned about how to write queries against Google spreadsheets using the visualisation API query language, so can I push that another step forward…? What don’t I know how to do that could be useful, and that I could try to demo in an app using this resource?

How about this: a web page that lets me pull the result out for a searched for by name university.

Hmm, what would that mean? A demo around a single subject, maybe, or should the user be able to specify the subject? The subject areas are listed, but how do get those in to a form? Copy the list of tab names from the spreadsheet? Hmm… Or how about entering the name of a single university and displaying the results for that HEI in each of the categories. That would also require me to find out how many sheets there were in the spreadsheet, and then interrogate each one…

Okay, so it’d be nice to be able to search for the results of a given university in a given subject area, or maybe even compare the results of two universities in a given subject area?

So to do that, do I need to learn how to do something new? If not, there’s no point, it’s just makework.

Well, I don’t know how to grab a list of worksheet names from a Google spreadsheet, so that’d be something I could learn… So how to do that?

Ah: “The worksheets metafeed lists all the worksheets within the spreadsheet identified by the specified key, along with the URIs for list and cells feeds for each worksheet”

I have no idea what the “list and cells feeds” means, but I’m not interested in that; “lists all the worksheets within the spreadsheet identified by the specified key” is what I want. Okay, so where’s a URL pattern I can crib?

Okay, that works… No obvious way of getting the gid of the worksheet number though, unless maybe I count the items and number each one…? The order of worksheets in the feed looks to be the sheet order, so I just need to count them 0,1,2 etc from the top of the list to gd the worksheet gid. Ah, there could be an opportunity here to try out the YQL Execute in a pipe trick? After all, the demo for that was an indexer for feed items, and because the API is chucking out RSS I need to use something like a pipe anyway to get a JSON version I can pull into my web page.

Hmmm, what else is there on the docs page? “alt, start-index, max-results Supported in all feed types. ” I wonder? Does alt stand for alternative formats maybe? Let’s try adding ?&alt=json to the URL – it may work, or it may relate to something completely other…. [Success] heh heh :-) Okay, so that means I don’t need the pipe?

What else – anything that could be useful in the future? Hmm, seems like the Spreadsheets API actually supports queries too? So e.g. I can run a query to see if there is a sheet that contains “geo” maybe?

Okay – lots of other query stuff there; remember that for another day…

So: to recap, the above process took maybe 10-15 mins and went from:

– initiator: see a tweet;
– follow-up: look at the resource;
– question: is there something I could do with that data that I don’t know how to do?
– question refined: how about I pull out a list of the worksheets from the spreadsheet, and use that in e.g. a drop down box so students can choose a subject area from a list, then search for one, or compare two, HEIs in that subject area. I don’t know how to get the list, and I’m not sure about the best way of comparing two items, so I’ll probably learn something useful.
– solution finding: check out the Google spreadsheets API documentation; (If that had failed, I’d have done a blogsearch along the lines of ‘feed list worksheets google spreadsheet’
– plan: err, okay, the plan is a form that pulls a list of worksheets from the HEI spreadsheet via JSON, indexes each one to give me the worksheet gid number (this is a possibly flakey step? Could I index the spreadsheet by name?) then builds a query on that worksheet using an input from one or more text boxes containing the name of HEIs (or maybe a single text box with comma separated HEI names?)

Normally I’d have then spent up to an hour messing around with this (it is (working) lunchtime i.e. playtime after all), but instead I spent forty five mins writing this blog post… which means there is no demo…:-(

So I had another bit of a tinker, and came up with some code that’s breaking all over the place, but I think there’s enough of a vision there to have something to say, so I’ll say it…

How’s about a generic query’n’viz tool for the Guardian datastore? My first (and maybe last) attempt at a back of an envelope, sometimes it works, sometimes it doesn’t, bare bones rapid prototype of just such a thing can be found here.

So referring to the image below, if you select a spreadsheet from the drop down list and click preview, you should get a preview of the column headings from that spreadsheet:

(The new link is to an original Guardian blog post announcing or describing the data.)

The list items are pulled in from a tag on my delicious account, which actually bookmarks the original data blog posts. The URI for the spreadsheet is added to the end of the bookmark description, and keyed with a –:

ISSUE 1: Sometimes the spreadsheet doesn’t load… I don’t know if this is down to something I’m (not) doing or not (if you’ve seen this sort of error and know a cause/fix, please post a comment below).

I’ve found if you just keep canceling the alert and clicking “Preview” the file loads in the end…

Another new feature is the ability to preview results using various chart types, rather than just use a data table:

(Oh yes – the “bookmark” link should also allow you to share the current view with other people. At least, it shares the spreadsheet ID and the query, but not the view type…)

I haven’t implemented chart labeling, or the ability to set what values are used for what bit of the chart, so chart compomnent default rules apply. By juggling the queries (including changing the order of columns that appear in the various text boxes), you can sometimes get a reasonable chart out.

Of course, you can always just grab the CSV URL and then visualise the data in something like Many Eyes Wikified.

The chart components I used are all taken from the Google Visualisation API, so they play nicely with the Google data source representation that holds the data values.

So, that’s where it’s at, and that’s probably me done with it now… (I think I can see what’s possible so my fun is done…) And if you haven’t got an inkling of what it is I think I can see, it’s this:

A front end to the Guardian data store that lets readers:
– select a data set from the datastore (and maybe get a chance to view the original story from the datablog; I guess this could be pulled in from the Guardian OpenPlatform API?)
– write queries on that dataset to generate particular views of the data;
– generate CSV and HTML preview view URLs for a particular query, so the data can be shared with other people (turning different views on subsets of the data into social objects);
– generate quick visualisation previews of different views of the data.

Nice to haves? Maybe links to stories that also mention the data, again via the OpenPlatform API? A collection of different bookmarks/views that use the same spreadsheet, so readers can share their own views of the data (the sort of social thing that Many Eyes Wikified offers). An opportunity to accept comments on a dataset? etc etc

All told, I reckon it’s taken less than 20 hours of solo effort (except for a bit of 3rd party bug spotting ;-), plus time to write the blog posts, to get this far (but the state of the code shows that: it’s truly scrappy). A fair amount of that time was spent learning how to do stuff and looking at exemplar code on Google AJAX APIs Code Playground. Of course, there are bugs’n’issues all over the place, but as people bring them to my attention, I tend to see if there’s a quick fix…

PS (I think) I’ve just noticed a Google data source wrapper for Pachube (Google Visualization API for Pachube history), which means that as well as pulling in Guardian datastore content from Google spreadsheets (as well as other publishers’ content on Google spreadsheets), this ‘interface’ could also be applied to Pachube data. (If you know of anyone else who exposes the Google visualisation/data source API, please post a link below.)