Trying to find useful things to do with emerging technologies in open education and data journalism

Reshaping Horse Import/Export Data to Fit a Sankey Diagram

As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat

*One for the to do list – a round up of “#ddj” stories around the episode.)

(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)

The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:

One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.

Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:

(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)

So how did I get to the diagram from the data?

As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.

So how can we create the graph from the data?

The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.

So how can we do that?

One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:

In contrast, the edge list looks something like this:

So how do we get from one to the other?

Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:

(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)

Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:

#python code
import networkx as nx
import StringIO
import csv
#Bring in the edge list explicitly
#rawdata = '''"SLOVENIA","AUSTRIA",1200
#"AUSTRIA","BELGIUM",134600
#"BULGARIA","BELGIUM",181900
#"CYPRUS","BELGIUM",200600
#... etc
#"ITALY","UNITED KINGDOM",12800
#"POLAND","UNITED KINGDOM",129100'''
#We convert the rawdata string into a filestream
f = StringIO.StringIO(rawdata)
#..and then read it in as if it were a CSV file..
reader = csv.reader(f, delimiter=',')
def gNodeAdd(DG,nodelist,name):
node=len(nodelist)
DG.add_node(node,name=name)
#DG.add_node(node,name=name)
nodelist.append(name)
return DG,nodelist
nodelist=[]
DG = nx.DiGraph()
#Here's where we build the graph
for item in reader:
#Even though import and export countries have the same name, we create a unique version depending on
# whether the country is the importer or the exporter.
importTo=item[0]+'.'
exportFrom=item[1]
amount=item[2]
if importTo not in nodelist:
DG,nodelist=gNodeAdd(DG,nodelist,importTo)
if exportFrom not in nodelist:
DG,nodelist=gNodeAdd(DG,nodelist,exportFrom)
DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount)
json = json.dumps(json_graph.node_link_data(DG))
#The "json" serialisation can then be passed to a d3js containing web page...

What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.

PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:

So it seems that the more I look, the more and more places seems to making data that appears in reports available as data…

@Bruce Thanks for the comment and for sharing that link – I think it’s really useful being able to demonstrate a variety of ways of getting these graphics out, given folk have different tool preferences sometimes can’t quite see how a technique using one tool may be relevant to them and their preferred choice of tool.

If you are interested in more agricultural related statistic such as trade, production and resources, you can find them on FAOSTAT(http://faostat3.fao.org/home/index.html) which has an accompany R package FAOSTAT to download the trade data in both adjacency matrix and edge list form to save you time (Unfortunately the trade data are currently being reviewed and updated so it will not be available for several weeks.). We have trade data for 572 commodities for most member countries of FAO starting from 1960 if available.

We also have a statistical yearbook produced entirely from R and LaTeX, any feedbacks on the book and the package are greatly appreciated.

Thanks for that tip Michael – will take a look. One thing that may promote the “worked use” of data is to see it being published in packages that play with data handling power tools, such as R. (For all its widespread use, Excel doesn’t really do it for me. Spreadsheets are too disorganised and too easy to mess up, as for example documented by EuSpRIG (http://www.eusprig.org/), the European Spreadsheet Risks Interest Group …)

@Michael I’ve just been having a quick look at FAOstat3 site, and can find pathways to create custom XLS data downloads – but I don’t see any R packages? (Or are they under revision?)

Looking round the site, it also got me wondering about the utility of delivering that sort of functionality in eg a Shiny [ http://www.rstudio.com/shiny/ ] application (with the data maybe being pulled in from the online database?) buit on top of/around an FAOstat library?

There are no analysis case study, just a demonstration of how the package allow the user to find and download data from FAOSTAT. But I do have plans to make use of it and release some basic analysis weekly to demonstrate the use and publicize these data on my blog.

The package was only released 20 days ago thus the link to the package and the yearbook are still under way.

We used to have web application under the “analysis” tab but it is under revision (We are quite ambitious, so a lot of things are under revision and update to make it better =P), thanks for the suggestion and I think shiny is a good idea as well!