Cui bono? The problem with opening up data

There’s a clear argument for opening up public sector information for reuse. It increases transparency. It’s the raw material for new kinds of services with public or commercial value. It improves measurability and makes possible new kinds of analysis. Therefore, the public sector should open up more of its data, in reusable formats.

Building a catalogue of data held by public sector organisations (like the great one set up in Washington, DC and now the Federal data.gov) is a logical starting point. Audit what’s held in departments, who’s responsible for it, and publish the list (with links to the relevant datasets) for potential reusers to come and browse. Or even go a step further as they’re doing at Kent County Council with their Pic-and-Mix pilot, and develop a mashup hub too, for end users themselves to develop and discuss applications which use that data.

Richard is now asking for feedback on the formats and structure of the UK’s data catalogue – do go and give him your thoughts. But I’m still troubled by some more fundamental problems than whether we publish the data in JSON or RSS:

Which data? For example, I suspect a simple, definitive postcoded list of UK higher education institutions would be useful to a fair number of people developing map-based mashups – though I’m not sure a civil servant would identify just how useful that kind of thing could be. I wonder whether a mechanism a bit like a souped-up, less forlorn version of OPSI’s ‘data unlocking service’ might provide a forum for potential re-users make ‘bids’ for useful data – even if they don’t know where it sits or what form it takes – and develop a community which can assess, prioritise and refine those data specification.

Who decides whether to publish? Proactively publishing data almost inevitably increases an organisation’s short-term potential exposure to criticism (even if it reduces it in the long term). It invariably generates tedious work for which the perceived ‘market’ is tiny. To play devil’s advocate: from a civil servant’s perspective, what makes the ‘open data people’ any different from the cranks who’ve always made trouble for bureaucrats by asking vexatious questions? There’s no big queue of citizens asking for data right now, any only a hypothetical end user audience for hypothetical tools based upon it. Ask an IT manager, a press officer and a policy official whether to publish any given dataset and you’re likely to get three radically different answers. We need some pretty clear principles to determine what gets published, to prevent our data catalogues being reduced to the blandest lowest common denominator.

Who benefits? The civil service isn’t the machine it’s sometimes portrayed as: ours is a surprisingly small, somewhat stretched, government of humans for whom opening up data is not – to put it mildly – a top priority, even where the data itself are simple and uncontroversial. We can tell these people to do it, but until we can show them where the benefits lie – not just in the social value, but the benefit for their organisation and for them personally – we’re unlikely to get buy-in on a large scale.

Who pays? Cleaning up data for publication, documenting it, checking it for errors or personal details, reformating it, uploading it, answering queries about it – there’s a lot of work involved in open data. It’s not in estabished job descriptions – so we’re likely to need more people to do this work, if it’s to happen on a large scale. Now, as taxpayers, we might decide that’s a cost worth paying. But for how many datasets? And at what maximum cost?

For how long? As any knowledge manager will tell you, information has a life cycle. Publish it now, and in six months’ or six years’ time, bit rot may have rendered it useless. Who is going to be responsible for maintaining the data when published, and what liability should public bodies accept for its misuse or inaccuracy when used by third parties? If the hospital Mashup A told you was at map co-ordinates X,Y turns out not to be, who are you going to be able shout at about it?

Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. Encourage three or four independent people passionate about, say, transport or secondary education, who know and respect the system enough to know how to extract useful data, without rattling too roughly the cages of the people who will be asked to provide it. They’ll know when the data changes, or what a reasonable request is, or where something can be found because… they just know that area like the back of their hands. Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets. And then point the digital mentors at the data garden, to get communities to come and enjoy the flowers in ways that enrich their lives.

Personally, I passionately want to see open data work in the UK. But as with so much on the web, I think the primary challenges will be sociological, not technical.

21 comments on “Cui bono? The problem with opening up data”

Brilliant, couldn’t have put it better myself. Like all online innovation, implementing this in government is going to be well more difficult than outsiders think. You can easily see some central diktat to already overworked web or IT people, who nothing about this area, mandating them to *do this* without any consideration of how and why. Wouldn’t be the first time…

I’m not sure you are necessarily correct about the cleanup requirements associated with publishing data.

I understand that the data may not be 100%, but most of us who want the datasets are quite happy to work with – and around – the imperfections. I certainly don’t want our civil servants to be tasked with cleaning-up the data before publishing it. I’m pretty sure that will be used as an excuse to delay it.

Couldn’t put it better myself Steph and this is certainly what we have found through developing our Pic & Mix project down in Kent.

All the questions you mention are interlinked – Which data? Who decides whether to publish? Who benefits? Who pays? For how long?

Given the infinite sets of data across a council or government dept, let alone across all public sector, you would need a diktat to make sure that all publicly available data was released in a re-usable way.

Let’s take a very specific example of the benefits of opening up data to be mashed together. You’ve got a family moving to a new area – father’s been relocated, mother needs to find new job, their 16 yr old kid wants to start an apprenticeship and the three yr old needs some kind of childcare arrangements – oh yeah and throw in good public transport connections and after school activities for the teenager.

There is a lot of work going on around making services adapted to needs of people and place than aligned to departmental processes – just look at Every Child Matters, CAA and MAAs.

If we start opening up data in line with these areas, then service managers (and data owners!) would see it as part of the toolkit to making it easier to achieve specific outcomes they have?

With Edinburgh we are looking at involving the citizen in the design. There might be an “app” for moving into a new area ; searching for somethingetc…..

The intention is to build an AnywhereCouncil app. set, since the act of moving is the same from John o Groats to Penzance.

This will be a possible UK first, as I don’t remember any government initiative asking the citizen to design their services before. That is why Job Centre Pro Plus may actually work !

We have public servants, but they may not often serve the name in their title.

As far as mashing the data, people are planning for it all to be released in ” as-is ” state, together with early years school and nursery information, including e.g. post-code. Of course, the Ordnance Survey continues to play a 19th century apprach to this. Unlike Denmark or other sensible countries where collaboration is actually encouraged.

Then the coders, and technical folk, together with the citizend, council officials and members can get on with designing what they need for the people who live here.

Alex, at Kent we’ve prioritised getting the citizens involved in the design as well more than building a localised version of data.gov.

What people create for themselves can benefit others and it also harnesses their creativity and energy – being able to “build stuff that matters” (to paraphrase Tim O’Reilly). It also can provide a great way for us to find out what different services (don’t just mean our ones!) our residents try and connect to for any given need they have (customer insight!).

We now get people coming to us saying “we just built a microsite with the other local partners to engage on x issue, but we had worked with you, we could have just used your mashup hub”

I know the PoI taskforce produced some really good recommendations for local government, but those are still organisation focused – we need PoI to talk to the people setting the “place shaping” agenda…mind u if Kent & Edinburgh are just getting on with it anyway, maybe that’s not so important?

Government should do much as the Guardian has done with the expenses: publish data in its raw format to allow people to do what the hell they like with it. Mash it up, stick it on their fridge or move house to a postcode served by better schools. Government can itself be a “customer” of that data, doing whatever analysis it chooses, but by putting the data out there, the onus is on individuals and the private sector to add value by turning it into information, from which individuals can garner knowledge.

I understand that the data may not be 100%, but most of us who want the datasets are quite happy to work with – and around – the imperfections. I certainly don’t want our civil servants to be tasked with cleaning-up the data before publishing it. I’m pretty sure that will be used as an excuse to delay it

I know what you’re saying, and Jeremy’s said much the same too over on Richard’s post. But ‘clean up’ could just involve important but slightly laborious tasks like confirming a dataset doesn’t reveal citizen’s personal data (e.g. names + postcodes, NI numbers or whatever), combining three messy spreadsheets into one, or putting in titles and basic notes so someone coming to the data ‘cold’ understands how it’s structured.

So while I agree that there’s a risk it will be an excuse to delay and needlessly polish, if civil servants start just publishing the content of their hard drives, the ICO and others will justifiably get upset.

@Steph
I agree entirely that personal data such as NI numbers should not be published. But I think that type of issue should never be called ‘clean up’. The 4 precursors to publication you seem to mention above are:
*Cleaning up data for publication
*documenting it
*checking it for errors or personal details
*reformating it
Of those there is only ½ that I would allow, i.e. checking it for … personal details

The rest can wait. You may even find that joe public do much of the rest for you, post publication.

And the personal details issue needs to be addressed much further back up the chain. What good reason is there for personal data to be in most civil service datasets? Prevention is much better than cure.

[…] started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON […]

Following on from your question, ‘what do I know’, I think it would be useful for readers of your blog (i.e. me, at least) if you did two things:

1) Offer more bio info about yourself. Your experience that brought you to do and write about this stuff. When you say, ‘I know how to copy and paste URLs into things…’, I read that as ‘you, the reader, should know how to copy and paste URLs into things…’. Sometimes, I read your posts and think, ‘that’s all very well, but to make this work for me, I’m going to have to learn SQL, Regular Expressions and understand the principles of RESTful architecture’. Knowing a bit more about where you are coming from in terms of skills, would provide useful context, I think.

2) So, with this in mind, (and as I alluded to in a tweet over the weekend), along side the ‘how to mashup’ posts, I think it would be both interesting and really useful, to have an OUseful skills curriculum. A page or category of posts, that lays out the basic set of skills that readers should be working on if we’re interested in contributing to the type of (good) work you’re doing. Links to quality resources and tutorials elsewhere would be really complement your (educational) work, I think. Many of your posts are well-structured tutorials but I feel like they’re written for people that have been with you from the start and there’s nowhere for the new reader to get up to speed with both who is writing and the skills you’re assuming readers should work on in parallel to following your tutorials.

If the data is in the public domain (e.g PDDL, CC0) then we’ll host it for free. While the core data is all managed as RDF, there are ways to access the data, e.g. using a configurable search engine that supports facets, that non-semweb people can still get data out as RSS. The Platform also supports an XSLT service, so transforming a SPARQL query result into KML or CSV is easy to do with a bit of URL pipe-lining.

Front that with a website for browsing datasets, and you’ve got the makings of exactly the kind of infrastructure you want to see.

I agree Tony. I have been a little disappointed with the data they are putting up. I’d hoped to spend some lunchtimes messing around with their datasets, but have abandoned any hope of that. Why? Because, as you say, the data is so randomly named or collated. It’s simply not possible to quickly pull 2 tables from their tables and compare them. One would have to do a lot of data cleansing just to be able to “play” with it. And do I have the time to do that cleansing? No.

I also concur that their inconsistent link naming is infuriating. If you say “DATA:” then don’t link to a different blog post!