communicating with data

government

The United States are the only western country without a centralized data office. Instead, official statistics are produced by well over 100 agencies. This makes obtaining official US data difficult, and that’s somewhat of a paradox because in most cases, these data are public and free. Of course, with data coming from so many sources, they are also in a variety of shapes and sizes. Says Wired,

Until now, the US government’s default position has been: If you can’t keep data secret, at least hide it on one of 24,000 federal Web sites, preferably in an incompatible or obsolete format.

A commitment made by the Obama administration was to tackle this and make data more widely available. To that end, a data portal was announced in early April and data.gov was officially launched end of May.

Data.gov is three things in one.

A sign that this administration wants to make the data more accessible, especially to developers.

A shift towards open formats, such as XML.

A catalogue of datasets published by US government agencies.

The rationale is that with data.gov, data are available to wider audiences. There’s a fallacy in that, because the layperson cannot do much with an ESRI file. But hopefully, someone will and may build something out of it for the good of the community.

The aspect I found most interesting is the catalogue proper. For each indexed dataset, data.gov builds an abstract, inspired by the Dublin-Core Metadata Initiative, with fields such as authoring agency, keywords, units, and the like. This, in itself, is not a technological breakthrough but imagine if all the datasets produced by all the agencies were described in such a uniform fashion. Then, retrieving data would be a breeze.

Note that data.gov does not store the datasets. They provide a store-front which then redirects users to the proper location once a dataset has been selected.

There have been other, similar initiatives. Fedstats.gov, allegedly, provided a link to every statistical item produced by the federal government. By their own admission, the home page was last updated in 2007, and its overall design hasn’t changed much since its launch by the Clinton administration in 1997 (a laudable effort at the time). Another initiative, http://usgovxml.com, is a private portal to all data available in XML format.

It can come as a surprise that they don’t touch the last 3 steps. Well, it certainly will be a surprise for anyone expecting the government to open a user-centric, one-stop-shop for data. Data.gov is certainly not a destination website for lay audiences.

It doesn’t host the data either, however, its existence drives agencies to publish their datasets in compliance with its standards. So we can say that it indirectly addresses access.

So what it really is about is finding data. Currently, the site has two services to direct users to a dataset: a search engine and a catalogue. The browsable catalogue has only one layer of hierarchy, and while this is fine with their initial volume (47 datasets, around 200 as of end of June) that won’t suffice if their ambition is to host 100,000 federal data feeds.

All in all, it could be argued that data.gov doesn’t do much by itself. But what is interesting is what it enables others to do.

On the longer term, it will drive all agencies to publish their data under one publication standard. And if you have 100,000 datasets published under that standard, and if people use it to find them, then we will have a de facto industry standard to describe data. The consequences of that cannot be overestimated.

The other not obvious long-term advantage is what it will allow developer to create. There are virtually no technical barriers to creating interesting applications on top of these datasets. Chances are that some of these applications could change our daily lives. And they will be invented not by the government, but by individuals, researchers or entrepreneurs. quite something to be looking forward to.