What

Open Data Maker Nights are informal events focused on “making” with open data – whether that’s creating apps or insights. They aren’t a general meetup – if you come, expect to get pulled into actually building something, though we won’t force you!

Who

The events usually have short introductory talks about specific projects and suggestions for things to work on – it’s absolutely fine to turn up knowing nothing about data or openness or tech as, there’ll an activity for you to help and someone to guide you in contributing!

4. Tooling

5. Feedback on standards

There’s been a lot of valuable feedback on the data package and json table schema standards including some quite major suggestions (e.g. substantial change to JSON Table Schema to align more closely with JSON Schema - thx to jpmckinney)

Next steps

There’s plenty more coming up soon in terms of data and the site and tools.

PublicBodies.org is a database and website of “Public Bodies” – that is Government-run or controlled organizations (which may or may not have distinct corporate existence). Examples would include government ministries or departments, state-run organizations such as libraries, police and fire departments and more.

We run into public bodies all the time in projects like OpenSpending (either as spenders or recipients). Back in 2011 as part of the “Organizations” data workshop at OGD Camp 2011, Labs member Friedrich Lindenberg scraped together a first database and site of “public bodies” from various sources (primarily FoI sites like WhatDoTheyKnow, FragDenStaat and AskTheEU).

The simplicity of CSV for data plus simple templating to flat-files is very attractive. There are some drawbacks such as changes to primary template resulting in a full rebuild and upload of ~6k files so, especially as the data grows, we may want to look into something a bit nicer but for the time being this works well.

Next Steps

There’s plenty that could be improved e.g.

More data - other jurisdictions (we only cover EU, UK and Germany) + descriptions for the bodies (this could be a nice crowdcrafting app)

I’m playing around with some large(ish) CSV files as part of a OpenSpending related data investigation to look at UK government spending last year – example question: which companies were the top 10 recipients of government money? (More details can be
found in this issue on OpenSpending’s things-to-do repo).

The dataset I’m working with is the consolidated spending (over £25k) by all UK goverment departments. Thanks to the efforts of of OpenSpending folks (and specifically Friedrich Lindenberg) this data is already nicely ETL’d from thousands of individual CSV (and xls) files into one big 3.7 Gb file (see below for links and details).

My question is what is the best way to do quick and dirty analysis on this?

Examples of the kinds of options I was considering were:

Simple scripting (python, perl etc)

Postgresql - load, build indexes and then sum, avg etc

Elastic MapReduce (AWS Hadoop)

Google BigQuery

Love to hear what folks think and if there are tools or approaches they would specifically recommend.

I’ve been working to get Greater London Authority spending data cleaned up and
into OpenSpending. Primary motivation comes from this question:

Which companies got paid the most (and for doing what)? (see this issue for
more)

I wanted to share where I’m up to and some of the experience so far as I think
these can inform our wider efforts - and illustrate the challenges just getting
and cleaning up data. I note that the code and README for this
ongoing work is in a repo on github: https://github.com/rgrp/dataset-gla

Data Quality Issues

There are 61 CSV files as of March 2013 (a list can be found in scrape.json).

Unfortunately the “format” varies substantially across files (even though they
are all CSV!) which makes using this data real pain. Some examples:

no of fields and there names vary across files (e.g. SAP Document no vs
Document no)

number of blank columns or blank lines (some files have no blank lines
(good!), many have blank lines plus some metadata etc etc)

There is also at least one “bad” file which looks to be an excel file saved
as CSV

Amounts are frequently formatted with “,” making them appear as strings to
computers.

What is the Data Explorer

Importing data from various sources (the UX of this could be much improved!)

Viewing and visualizing using Recline to create grids, graphs and maps

Cleaning and transforming data using a scripting component that allows you to write and run javascript

Saving and sharing: everything you create (scripts, graphs etc) can be saved and then shared via public URL.

Note, that persistence (for sharing) is to Gists (here’s the gist for the House Prices demo linked above). This has some nice benefits such as versioning; offline editing (clone the gist, edit and push); and bl.ocks.org-style ability to create a gist and have it result in public viewable output (though with substantial differences vs blocks …).

What’s Next

There are many areas that could be worked on – a full list of issues is in github. The most important I think at the moment are:

As I wrote to the labs list recently, continually adding these to
core Recline runs the risk of bloat. Instead, we think it’s better to keep the
core lean and move more of these “extensions” out of core with a clear listing
and curation process - the design of Recline means that new backends and
views can extend the core easily and without any complex dependencies.

This approach is useful in other ways. For example, Recline backends are
designed to support standalone use as well as use with Recline core (they have
no dependency on any other part of Recline - including core) but this is
not very obvious as it stands (where the backend is bundled with Recline). To
take a concrete example, the Google Docs backend is a useful wrapper for the
Google Spreadsheets API in its own right. While this is already true, when this
code is in the main Recline repository it isn’t very obvious but having the
repo split out with its own README would make this much clearer.

Thus, if you want to archive twitter you’ll need to come up with another solution (or pay them, or a reseller, a bunch of money - see Appendix below!). Sadly, most of the online solutions have tended to disappear or be acquired over time (e.g. twapperkeeper). So a DIY solution would be attractive. After reading various proposals on the web I’ve found the following to work pretty well (but see also this excellent google spreadsheet based solution).

The proposed process involves 3 steps:

Locate the Twitter Atom Feed for your Search

Use Google Reader as your Archiver

Get your data out of Google Reader (a 1000 items at a time!)

One current drawback of this solution is that each stage has to be done by hand. It could be possible to automate more of this, and especially the important third step, if I could work out how to do more with the Google Reader API. Contributions or suggestions here would be very welcome!

Note that the above method will become obsolete as of March 5 2013 when Twitter close down RSS and Atom feeds - continuing their long march to becoming a fully more closed and controlled ecosystem.

As you struggle, like me, to get precious archival information out of Twitter it may be worth reflecting on just how much information you’ve given to Twitter that you are now unable to retrieve (at least without paying) …

Unfortunately twitter atom queries are limited to only a few items (around 20) so we’ll need to continuously archive that feed to get full coverage.

Archiving in Google Reader

Just add the previous feed URL in your Google Reader account. It will then start archiving.

Aside: because the twitter atom feed is limited to a small number of items and the check in google reader only happens every 3 hours (1h if someone else is archiving the same feed) you can miss a lot of tweets. One option could be to use Topsy’s RSS feeds http://otter.topsy.com/searchdate.rss?q=%23okfn (though not clear how to get more items from this feed either!)

Gettting Data out of Google Reader

Google Reader offers a decent (though still beta) API. Unoffical docs for it can be found here: http://undoc.in/

And that’s it! You should now have a local archive of all your tweets!

Appendix

Increasing Twitter is selling access to the full Twitter archive and there are a variety of 3rd services (such as Gnip, DataSift, Topsy and possibly more) who are offering full or partial access for a fee.

WikipediaJS is a simple JS library for accessing information in Wikipedia articles such as dates, places, abstracts etc.

The library is the work of Labs member Rufus
Pollock. In essence, it is a small wrapper around the data and APIs of the DBPedia project and it is they who have done all
the heavy lifting of extracting structured data from Wikipedia - huge credit
and thanks to DBPedia folks!

Demo and Examples

A demo is included and you can see some examples of the library in action at the following links:

This uses the Recline timeline component (which itself is a relatively thin wrapper around the excellentVerite timeline) plus the Recline Google docs backend to provide an easy way for people to make timelines backed by a Google Docs spreadsheet.

Post navigation

About

Dr Rufus Pollock is Founder and President of Open Knowledge, an international non-profit using advocacy, technology and training to unlock information and turn it into insight and change. He was formerly a Shuttleworth Foundation Fellow, and the holder of the Mead Fellowship at Emmanuel College, Cambridge. Read more »