Thursday, 2 April 2009

Context — stuff available before the Guardian Open Platform

http://guardian.co.uk is built and runs on a completely custom CMS called R2. This was built recently as an 18 month agile project supported by ThoughtWorks. It’s written in Java, using Spring to hold things together.

Everything on the Guardian site is based on tags (you can see the tags as headings in the right had column). These tags are a controlled vocabulary, created and maintained under editorial control.

Every page that is not an article is almost completely controlled by tags. You can visit a tag home page (like http://guardian.co.uk/technology) and by default it will show a bunch of articles appropriate for that tag. Alternatively, editors can take control of a tag page to highlight articles etc.

The system also supports “tag combiner pages” — pages showing articles from multiple tags, like http://guardian.co.uk/environment/climate-change. These are generated in exactly the same way as the single tag pages but will have fewer articles due to the extra filtering.

In addition, any tag page can generate its own RSS feed — just add /rss to the end of the URL. RSS feeds are full fat, sometimes with ads, but their Ts & Cs are currently for personal use only.

What is the Open Platform?

The new Open Platform allows for commercial use of the Guardian content. The Guardian have made everything available to which they have full rights (so no third party feeds and no content with restricted rights from freelancers). This includes full articles, audio and video (the audio and video is not served directly through the API but it provides enough metadata to link to it). There are currently around 700,000 articles stretching back at least a decade and sometimes longer.

The commercial terms seem reasonable and practical: you can build an interface that links to content on the Guardian web site for free, but it you want to publish the content on your own site then you’ll have to display Guardian adverts alongside it. This works for the Guardian too — they already get orders of magnitude more readers on their web site than their print circulation, and anything that increases their online advertising placement is therefore a good thing.

They have also built an API explorer (with a UI modelled on Firebug) that allows you to play with the data and discover the requests you’d like to make. All the data is URL based, with each article having a web URL (its location on the Guardian web site), a data URL (its location in the Open Platform API) and multiple tag URLs. The other entry point is the search URL, where you can search by text, tags and various dates.

Data is returned by default in a structured XML document, but you can switch this to JSON using the explorer UI. There’s also an ATOM feed available by appending &format=atom to the URL — this data is the least rich at the moment, but easier to feed into other services like Yahoo Pipes.

The explorer makes it easy to browse the content — following links between articles and filtering any searches by picking out tags. Each search returns not just articles with full metadata, but also a section detailing the tags of all the articles in the search. In the tag data you get the number of articles in the search with each tag — so it’s easy to see how further filtering will affect your search results.

At the moment, the results only list topic tags, but there are also contributor tags and other kinds — all editorially chosen, and selected as interesting topics. For example, not every MP has their own tag, just those that show up in the news. The other kinds of tags will be added to the API shortly. These tags strike me as one of the most useful areas of the API — essentially the Guardian have built up an ontology of topics and they’re making it available for free. As far as I understand, if you use the tags but not the content (or at least link to any content on the Guardian website) then you don’t even need to put Guardian adverts on your site (this may be totally wrong — I haven’t read the Ts & Cs in full yet).

The Open Platform site has some examples on it already — the one that Simon Willison highlighted was a Stamen project: http://guardian.apimaps.org allows you to annotate Guardian articles with geo data (which isn’t built in by the Guardian yet) — then view the results on a map.

The Data Store

As well as the content of the articles themselves, the Guardian have opened up some of the data that they use to prepare them. They have a team dedicated to collecting and updating various statistics from bird populations to government budgets and everything in between. The team is putting this data online and available as Google Spreadsheets — currently there are over 100 (up from 86 at launch).

Some of this data is the result of weeks of effort from the Guardian team — the sheet of government data by department was so impressive that the Cabinet Office called and said they’d like a copy as they couldn’t get the data themselves! All the data is vetted and has been published in the newpaper, so you can be reasonably confident of its accuracy.

New data is regularly published on the Data Store blog and the Guardian is trying to build up a community of data wonks around it. Already some people are getting involved, taking the data and doing new things with it — see various examples at http://ouseful.info such as pulling in the MP expenses spreadsheet into IBM’s Many Eyes visualisation system. Previously published data is also being regularly updated by the Guardian team — Simon Rogers has pledged to keep updating various of the data sets he has created.