2014: The Year of CSV

Or, more accurately, my goal for 2014 is to make it easier for developers to use tabular data. Depending on your point of view, highlighting CSV as a format may seem a bit crazy. After all, it’s a basic data format that’s widely used and deployed so it’s hard to imagine that any work needs to be done with it. But it is also the cause of a lot of pain because of inconsistencies in how it is created: CSVs generated from standard spreadsheets and databases as a matter of course use variable encodings, variable quoting of special characters, and variable line endings.

The vast majority of published open data is tabular; over 90% of the data published on data.gov.uk is tabular data. When data is in a consistent tabular form, you can do all sorts of things with it. Web applications such as CKAN, Socrata, Google Fusion Tables, Junar, enigma and many others provide mechanisms to view the tabular data that they store in flexible ways: as tables, as graphs, as heatmaps. But they rely on the fact that they are storing the data themselves: they know it is in a consistent format, they know the types of values in different columns, they know what the codes within the data means.

For tabular data to be part of the web, we need to separate where data is stored from how it can be viewed and processed. We should be able to open a file of tabular data in our browsers, and see it displayed as a table. We should be able to use a <heatmap src="population.csv"> element in our web pages and view it as an interactive map. We should be able to search for data on the web using a search engine that looks inside the data itself to work out what it’s about.

More, I believe that tabular data needs to live not just on the web but in the web. We should be able to link from documents to cells within tabular data, from a cell in one table to a cell in another, and from tabular data to all the other rich information that exists on the web.

At the moment, tabular data on the web is published in one of two ways:

First, tabular data is published as Excel spreadsheets (or spreadsheets using another tool, but it’s most usually Excel). These provide a huge amount of flexibility for providing metadata about the dataset such as where it comes from, further details about what columns actually mean, and notes about the values in particular cells. Take a look at this spreadsheet from the Office of National Statistics in the UK, and you’ll see these features being used to the fullest extent. For statisticians — indeed anyone who cares about the responsible publication of data — the ability to provide this context, aiming to add explanation and avoid misinterpretation, is very important.

For developers, using Excel for data is problematic because the data is encoded within it in an unreliable and unpredictable way. It’s not desperately difficult, with a bit of work, to put something together that pulls data out of a single Excel file. But if that tool has to also work with similar Excel files, perhaps published in a series, each of which will have their formatting quirks, it becomes terribly time-consuming. Creating generic tools that can extract data from any Excel file is completely impractical.

Second, tabular data is published as CSV files (or possibly tab-separated or colon-separated files). These are usually relatively easy for developers to process, at least so long as a consistent encoding is used and correct quoting of values and escaping of characters has been carried out. But CSV files are desperately poor at providing contextual information that can aid in automated processing of the data that they hold, or even in informing developers about how they should be interpreted. Publishers who want to provide that context either package CSVs up with human-readable documentation (in zips arranged in non-standard ways), or assume that the CSV has been retrieved via a particular explanatory web page.

CSV is insufficient for tabular data on the web, with independent generic viewers, because it lacks this context. Not knowing which columns contain numeric data makes it impossible to sort correctly. Not knowing which values are links makes it impossible to connect the data into the web. Crucially, from an open data perspective, this lack of expressivity means there is no obvious way to express the licence under which CSV data can be reused.

There is a sweet spot between developer-friendly, unexpressive CSV and developer-hostile, expressive Excel.

If we’re successful, I expect to see implementation of parsers/importers for R, NoSQL databases such as MongoDB, SQL databases, and triplestores. I expect to see exporters/converters from those same databases and from spreadsheet software. I expect to see validators, visualisers and other generic tooling. In fact, we’re already working on a CSV validator at ODI.

And I’m not the only one who thinks CSV’s time has come. Max Ogden is organising CSVConf in July during OKFestival in Berlin. Mark the dates in your diary. (You should also check out Max’s work on synchronising data within dat.)