Blog IODC 2016
Madrid. October 6-7, 2016

Do We Need to Educate Open Data Users?

Tony Hirst is a Senior Lecturer at The Open University, UK, open data practitioner, and regular blogger at ouseful.info.

Whilst promoting the publication of open data is a key, indeed necessary, ingredient in driving the global open data agenda, promoting initiatives that support the use of open data is perhaps an even more pressing need.

A quick survey of the open data portals hosted by organizations such as the World Bank or the United Nations reveals a staged approach towards promoting data use. In the first case, the portal should make data discoverable at the level the user is likely to want it: a particular indicator for a particular country, or set of countries in a particular region or grouping. Secondly, providing a sortable tabular preview of the data allows for quick comparisons to be made over a small dataset. Some portals also include graphical tools or simple analysis tools for visualizing or otherwise working with the data in situ. Third, the site is likely to provide a tool for exporting the selected data in one or more standard formats – simple CSV text files, or popular spreadsheet formats.

The portal may also make data available via an API – a machine readable application programming interface that allows computer software to interrogate and retrieve the data store directly. APIs support two main ways of working: building your own website that republishes the original data after making calls directly to it, or using data analysis tools that “wrap” the API and allow you to pull the data into the analysis tool directly.

Already, these different modes of publishing the data support different sorts of users with different sorts of skills – and different training needs.

The user of the portal’s own tools are faced with portal specific concerns – “what does this indicator actually refer to?”, for example – as well as more generic data skills, such as being able to read a chart. If a picture saves a thousand words, in the case of a chart, what would those thousand words be, and how many of actually spend the time to “read” the equivalent of a thousand words of description from a chart? Even a simple bar chart can be read in many ways, depending on features as simple as how the bars are sorted, or grouped.

This, then, is the first issue we need to address: improving basic levels of literacy in interpreting – and manipulating (for example, sorting and grouping) – simple tables and charts. Sensemaking, in other words: what does the chart you’ve just produced actually say? What story does it tell? And there’s an added benefit that arises from learning to read and critique charts better – it makes you better at creating your own.

Associated with reading stories from data comes the reason for telling the story and putting the data to work. How does “data” help you make a decision, or track the impact of a particular intervention? (Your original question should also have informed the data you searched for in the first place). Here we have a need to develop basic skills in how to actually use data, from finding anomalies to hold publishers to account, to using the data as part of a positive advocacy campaign.

After a quick read, on site, of some of the stories the data might have to tell, there may be a need to do further analysis, or more elaborate visualization work. At this point, a range of technical craft skills often come into play, as well as statistical knowledge.

Many openly published datasets just aren’t that good – they’re “dirty”, full of misspellings, missing data, things in the wrong place or wrong format, even if the data they do contain is true. A significant amount of time that should be spent analyzing the data gets spent trying to clean the data set and get it into a form where it can be worked with. I would argue here that a data technician, with a wealth of craft knowledge about how to repair what is essentially a broken dataset, can play an important timesaving role here getting data into a state where an analyst can actually start to do their job analyzing the data.

But at the same time, there are a range of tools and techniques that can help the everyday user improve the quality of their data. Many of these tools require an element of programming knowledge, but less than you might at first think. In the Open University/FutureLean MOOC “Learn to Code for Data Analysis” we use an interactive notebook style of computing to show how you can use code literally one line at a time to perform powerful data cleaning, analysis, and visualization operations on a range of open datasets, including data from the World Bank and Comtrade.

Here, then, is yet another area where skills development may be required: statistical literacy. At its heart, statistics simply provide us with a range of tools for comparing sets of numbers. But knowing what comparisons to make, or the basis on which particular comparisons can be made, knowing what can be said about those comparisons or how they might be interpreted, in short, understanding what story the stats appear to be telling, can quickly become bewildering. Just as we need to improve sensemaking skills associated with reading charts, so to we need to develop skills in making sense of statistics, even if not actually producing those statistics ourselves.

As more data gets published, there are more opportunities for more people to make use of that data. In many cases, what’s likely to hold back that final data use is a skills gap: primary among these are the skills required to interpret simple datasets and the statistics associated with them associated with developing knowledge about how to make decisions or track progress based on that interpretation. However, the path to producing the statistics or visualizations used by the end-users from the originally published open data dataset may also be a windy one, requiring skills not only in analyzing data and uncovering – and then telling – the stories it contains, but also in more mundane technical operational concerns such as actually accessing, and cleaning, dirty datasets.