CKAN

By Health & Social Care Information Centre | 24 April 2013

Introduction

A lot of very useful NHS data is published by data.gov.uk. To find it, one uses their installation of CKAN. In particular, its REST API. So far, so good. However, the CKAN is just a little bit too difficult to use from the command line or java.

So, as with other things, we wrote some open source clients to make it a lot easier. There are 6 clients:

hdn-ckan-list

hdn-ckan-details

hdn-ckan-relationships

hdn-ckan-query

hdn-ckan-dataset-search

hdn-ckan-resource-search

The split is to cater for differences in using and manipulating the CKAN API.

The clients are such that they can be used either as a standalone program (in which case, it produces results in TSV format for ease of use with standard POSIX tools) or as six java libraries (jar files). To make things as easy as possible, we’ve also written some wrappers for POSIX and Debian/Ubuntu. Your choices, in order of decreasing convenience, are:

A tar ball, which contains a complete file system to untar over your root /. These should work on any POSIX system, including Mac OS X and Cygwin.

Six standalone java jars, with all dependencies included, suitable for execution or as a library

A set of java code libraries with source

Forking from github

The best way to get going is to use the command line. We’ll look later on how to create requests programmatically using the java library.

Using the Clients on the Command Line

The way you do this varies depending on what you used above:

If you’ve installed the deb package or the tar ball, you’ll have the programs on your PATH. To use one of them, open a terminal console and type its name, eg hdn-ckan-list. They take standard POSIX options. The programs are:

hdn-ckan-list

hdn-ckan-details

hdn-ckan-relationships

hdn-ckan-query

hdn-ckan-dataset-search, and

hdn-ckan-resource-search

If you’ve installed the standalone jar files, you’ll need to run commands from the from the folder you downloaded the files to. Open a terminal console and change folder to the folder it is in. For example, for hdn-ckan-list, type java -jar hdn-ckan-list.jar. Each jar takes the same standard POSIX options as the program above. For the rest of this document, wherever you see hdn-ckan-XXXX … you can substitute java -jar hdn-ckan-XXXX.jar …

If you’ve downloaded or forked source from github, you can use IntelliJ to run the main classes. Open source\subprojects.ipr and run the main classes in the group ckan, sub-group client. There are already some sample configurations set up for you to debug in IntelliJ. If you don’t have or use IntelliJ (and really should) then you can open in Eclipse or NetBeans. You’ll need to add the libraries ‘annotations’ (library/annotations/VERSION/annotations.jar) and ‘jopt-simple’ (library/jopt-simple/VERSION/jopt-simple-VERSION.jar) to the class path.

Using hdn-ckan-list

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-list –help (remember to substitute java -jar hdn-ckan-list.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-list -h and hdn-ckan-list –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Getting a List of all the Known Identifiers on data.gov.uk

Apart from the options –help and –version, you need to specify one, and one only, of the options above. Each option returns a list of identifiers, with a header row.

This really couldn’t be easier:

hdn-ckan-list --dataset-names

Retrieves a list of dataset names, and displays it on standard out (stdout):

The exit code is 0. Effectively, the results are a single column of tab separated value (TSV) data, with a header row. The data in data.gov.uk isn’t particularly clean, and occasionally you’ll find data with embedded CR (Carriage Return), LF (Line Feed) and HT (Horizontal Tab) control codes. When these are encountered, the data returned uses the Unicode replacement character U+FFFD, as these control codes are invalid in TSV. You’ll know when this happens, as you’ll see a black diamond with a question mark in it on most modern terminals.

By the way, the sort order of the data isn’t known, and the order of the results is unlikely to be stable. If you don’t want the header row you can remove it using tail on POSIX systems:

Using hdn-ckan-details

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-details –help (remember to substitute java -jar hdn-ckan-details.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-details -h and hdn-ckan-details –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Getting a Dataset by name

Apart from the options –help and –version, you need to specify one, and one only, of the options above. Each option returns a single row of data, with a header row, as tab separated values (TSV).

The exit code is 0. The data in data.gov.uk isn’t particularly clean, and occasionally you’ll find data with embedded CR (Carriage Return), LF (Line Feed) and HT (Horizontal Tab) control codes. When these are encountered, the data returned uses the Unicode replacement character U+FFFD, as these control codes are invalid in TSV. You’ll know when this happens, as you’ll see a black diamond with a question mark in it on most modern terminals. This is the case above.

By the way, the sort order of the data isn’t known, and the order of the results is unlikely to be stable.

If you don’t want the header row you can remove it using tail on POSIX systems:

Unfortunately, the groups are formatted as a long string – you’ll can split this every 36 characters using cut.

Using hdn-ckan-relationships

This program finds the different kinds of relationships a dataset has to other datasets.

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-relationships –help (remember to substitute java -jar hdn-ckan-relationships.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-details -h and hdn-ckan-details –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Finding the datasets a dataset is a child-of

Apart from the options –help and –version, you need to specify one, and one only, of the options above. Each option returns a listof data, with a header row, as tab separated values (TSV).

Note that the order does not have to match that for dataset names. –as-dataset-ids can also take an argument of either true or false. true returns ids; false returns names. This is useful for scripting:

Finding other kinds of relationship

To find a different a different kind of relationship, replace –child-of with one of:-

–dependency-on

–depends-on

–dependency-on

–derives-from

–has-derivation

–linked-from

–links-to

–parent-of

Using hdn-ckan-query

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-query –help (remember to substitute java -jar hdn-ckan-query.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-query -h and hdn-ckan-query –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Getting dataset ids from a tag

This really couldn’t be easier:

hdn-ckan-query --dataset-ids-by-tag Jurassic

Retrieves a list of dataset ids, and displays it on standard out (stdout):

Using hdn-ckan-dataset-search

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-dataset-search –help (remember to substitute java -jar hdn-ckan-dataset-search.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

Option Description
------ -----------
--any <substring to search for case search in any field
insensitively>
--as-dataset-ids [Boolean: true if as returns results as Dataset Ids (UUIDs)
dataset ids; false or unspecified to
produce dataset names]
--author <substring to search for case search in the author field
--groups <substring to search for case search in the groups field (name or
insensitively> UUID)
--help Displays help for options
--maintainer <substring to search for search in the maintainer field
case insensitively>
--notes <substring to search for case search in the notes field
--tags <substring to search for case search in the tags field (a hyphenated
insensitively> tag)
--title <substring to search for case search in the title field
--update-frequency <substring to search by update frequency (uncertain
search for case insensitively> what this is)
--version Displays version

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-dataset-search -h and hdn-ckan-dataset-search –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Getting dataset ids from a tag

This really couldn’t be easier:

hdn-ckan-query --dataset-ids-by-tag Jurassic

Retrieves a list of dataset ids, and displays it on standard out (stdout):

Using hdn-ckan-resource-search

Checking everything’s OK

Before we get going, let’s check that everything works as expected. Run the command hdn-ckan-resource-search –help (remember to substitute java -jar hdn-ckan-resource-search.jar if you need to). You should see a list of supported options. At the time of writing, it looks like this:

Option Description
------ -----------
--description <substring to search for search in the description field
case insensitively>
--format <substring to search for case search in the format field (stick to
insensitively> file extensions)
--help Displays help for options
--url <substring to search for case search in the url field
insensitively>
--version Displays version

If the output seems a bit compressed, it’s because we’re formatting for a 40 character wide screen – useful if you’re running this over ssh on Android. Whilst you can’t see it above, help output always produces an exit code of 2.

Since the options are regular POSIX long options (and are named similarly to those in the GNU coding standards), we can abbreviate them. Hence hdn-ckan-resource-search -h and hdn-ckan-resource-search –he will produce the same output. The only time you can’t do this is if the abbreviation would be ambiguous.

Standard GNU-like stuff. It’s worth understanding the version number, in this case, 2013.03.01.1537-development. The part before the hyphen is the timestamp of the last git check in used to build the binary – you should be able to find it using git log. Additionally, this should match the version of the deb package. The part after is the git branch the code was built from. Usually this will be either development or master.

If instead it says unknown version then it means you’re using code you’ve compiled yourself or wasn’t released ‘officially’.

Finding resources

Just like hdn-ckan-dataset-search, you can combine options to generate and queries. One difference is that the strings specified are supposed to be used case insensitively, and that substrings are supposed to match.

Using the java library programmatically

If you’ve downloaded or forked source from github, you can use IntelliJ. Open source/subprojects.ipr and start hacking.

If you’ve downloaded the jars (and source zips), create a project or add them to an existing project in your favourite IDE (if it isn’t IntelliJ, then switch now).

You’ll need the jar files:

ckan-api

ckan-domain

ckan-schema

And their dependencies, which, at the time of writing, are:

common

common-http

common-http-client

common-http-client-json

common-reflection

common-exceptions

common-tuples

common-naming

common-parsers

common-parsers-json

common-serialisers

common-serialisers-separatedValues

and the library, annotations.jar.

This list may change. To find the most up-to-date list, either extract META-INF/MANIFEST.MF from ckan-api.jar and read the Class-Path entry, or open the IntelliJ project (source/subprojects.ipr) and look at the dependencies of the module `ckan-api (sensibly, module names match jar names and source zip names).

Making simple requests

The ‘guts’ of the java library’s API is the interface CkanApi. It’s in the package uk.nhs.hdn.ckan.api. This interface provides methods for all the read-only operations one might want to do against CKAN. The methods take some parameters and return an immutable ApiMethod. This works a bit like a java Method – think of it as a late-bound method – but is strongly typed. Effectively, it’s the same as .NET delegate. Calls on this object are thread-safe. The design idea here is that configuring everything for a REST call is quite a bit of work, but one might want to do the call many times.

A concrete instance of the CkanApi interface is the class ConcreteCkanApi. The easiest way to use this is with the constant DataGovUk which provides a properly configured instance for data.gov.uk. For example:

The first line uses a static reference to DataGovUk, which is an instance of ConcreteCkanApi, to create everything that’s needed to get all dataset names.

The second line makes the REST call and returns all the dataset names. If it goes wrong, one of three exceptions will be thrown:

CouldNotConnectHttpException – the service is unavailable due to network problems or remote server problems

UnacceptableResponseException – the response code was not 200 OK, there was no content or content was not JSON

CorruptResponseException – the response was not valid JSON that could be parsed and understood

Of course, a request might be one off, so you could chain it together:

final DatasetId[] execute = DataGovUk.allDatasetIds().execute();

And so pretend it’s like a regular java method (It’s worth noting that if this were Python, we’d make execute() the default method for the object, so one could just do DataGovUk.allDatasetIds()(), but java lacks such syntactic sugar).

Search Requests

Search requests are quite complex. There follow the same pattern as other API methods, but take three parameters:

searchCriteria – zero or more clauses to match. Multiple clauses to match are treated as an AND.

offset – the CKAN API returns paged results. This specifies the first row to be returned.

limit – the maximum number of rows to return (values over 1000 are clamped to 1000).

Interestingly, the offset and limit are implemented internally as search criteria by CKAN. For convenience, there are two constants in UnsignedLongSearchCriterion to deal with useful values for offset and limit:

MinimumOffset

MaximumLimit

This design by CKAN means that one may have to issue the same search criteria more than once, interpreting the previous search response to see if one should increase the offset (and optionally reduce the limit). Such a design works well for a web page, but less well for a regular client. It is also not clear if the search results between requests are consistent (eg ‘pages’ of results don’t get introduced) and stably sorted.

The search criteria uses a Humane Interface Design. One creates an instance of a SearchCriterion and ands clauses on to it. For convenience, common search criterion creating methods are in StringSearchCriterion: