Search This Blog

OpenRefine – an experiment in data cleaning

In a recent blog post
on Northern Ireland’s Renewal Heat Incentive (RHI) scandal [here]
I spent quite a bit of time recording all of the changes, tweaks, and decisions
I had to make to get the data into a usable format. With any dataset it is
important to understand the transformations that went into bringing it to its
final form. If other researchers are unable to follow your process and
consistently achieve the same results from the same dataset it brings your
analysis into question. Beyond that, it brings the whole endeavour of data
science and data analysis into disrepute. If you can’t rely on the figures to
tell a consistent story, you can’t make consistent decisions, and you can’t gain
reliable insights. You certainly can’t trust the folks who are furnishing you
this flawed and unreliable nonsense. If you can’t rely on the information
you’re seeing on your dashboard, what is it other than a collection of
interesting, but meaningless, colours and shapes?

While this should be a
consideration for any dataset you look at, it’s particularly apposite in this
instance. Even leaving aside the fact that the beneficiaries of the botched
scheme went to the High Court to try and suppress it, it’s a dataset that
essentially brought down the Northern Ireland Assembly, redrew the political
map here, and will have long-lasting impacts on how we are governed.

The other thing that was
on my mind was the fact that the Department of the Economy’s website said that
they were preparing a similar dataset of the private individuals who received
money from the scheme and that both lists would probably be updated on a
six-monthly basis. If there is sustained interest in this scandal (and the
dataset), I’ll probably be quite keen to keep the dashboard up to date. The
problem here is, of course, that I’m peculiarly lazy and the thought of
endlessly repeating all of my carefully outlined steps with each new release of
data just fills me with dread.

There is a whole
industry dedicated to data cleansing, and the brightest light in this firmament
is Alteryx. I’ve seen these guys demo
their product on a few occasions and it is simply stunning. Their website is
littered with the logos of all the huge, instantly-recognisable corporations
they serve. Understandably, their fee structure is commensurate to the work
they do. Equally understandable is the fact that this is way beyond the financial
reach of a blogger like me, so I need to find something else. I’ve been looking
around at a number of more affordable options when I happened to fall upon OpenRefine (formerly Google Refine). You can
find out all about it on their web page [here]
where you can buy a book and watch some training videos. My initial reaction to
the product is that it is really powerful and probably would require a significant
expenditure of time and mental energy to get the fullest out of it. That said,
I watched the three training videos on their site twice (the second time,
taking notes). That has been the entirety of my training with the product. It
was still sufficient to work through the list of changes I’d made to the
original dataset and in about the same amount of time it originally took to
complete. I’d particularly note the automated clustering feature that
independently noted a number of the near identical Business names I’d spotted
manually. It also discovered a couple that I’d missed the first time round.

But here’s the good bit
– it’s not just done for now and the next time the Department make an updated
version I’ve got to do it all again. OpenRefine allows the user to export the
actions taken as a JSON script. It not only allows you to run through the
process you created in a matter of seconds, it is easily publishable and can be
reviewed by other researchers to ensure consistent, robust datasets and analyses.

It is, of course, early
days, but I already feel that OpenRefine will become a permanent fixture of my
data analysis. One thing I am certain of is that when the Department of the
Economy publishes the next dataset of RHI beneficiaries, I’ll be ready, knowing
that my inherent laziness is no impediment to decent quality data analysis!

If you want to
replicate my analysis or error check my process, I’ve made the JSON file
available [here][... a little time passes ...]The above blog has been sitting on the server for about a week or so and I've been toying with the idea as to when would be the best time to publish it. The thing that has been on my mind is that it's lovely (if a but gushy) and it doesn't have any follow up in terms of how repeatable the process is. Well ... I was browsing the internet yesterday afternoon and happened upon the Department of the Environment's website [here] (Yep! I'm totally Rock 'n' Roll). They note that the previously available list had a number of errors that they've now fixed and that the list has been republished. In their accompanying list of errata [here], it's clear that these are not big changes - Green Energy Engineering had seven boilers, not nine; Stephens Catering Co Ltd had only one boiler, not five; and two companies with three boilers were added to the list. While there are significant amounts of money involved, the changes to the dataset aren't huge. This is also the situation I had dreaded - the idea of starting over from scratch having to make the same edits and changes over and over for little reward ... it was just! so! dull!My first instinct was to simply go to the excel sheet and do the edits manually, but then I thought: "OpenRefine! Let's see how well this works!" ... I downloaded the PDF (come on Dept of Economy! ... it data - publish it as a spreadsheet ... how is this difficult to understand?), converted it to Excel (thank you Smallpdf), started OpenRefine, opened the sheet, found the document where I'd saved off the JSON of the process, hit Apply > Perform Operations. Boom! it was done! Two hours of work last week reduced down to less time than it takes to type this sentence! ... Actually, it was done in less time than it takes to read this sentence! After that it was simply a case of exporting it out & badda bing, badda boom updated Tableau dashboard. Seriously, what is not to like?The original blog post is [here] and the updated dashboard is below (or on TableauPublic if you have any difficulties [here]). You won't notice much change in the actual data (the scheme has already cost us over £27m and the number of boilers has only dropped from 869 to 866), and you certainly won't notice how easy OpenRefine made the process ... and that's as it should be!

Comments

Post a Comment

Popular posts from this blog

[Note: the dashboard is
best viewed in Full Screen Mode (F11 or icon at bottom right)] Having attended the rather excellent Open
Data Camp 5, held in Belfast over the weekend of 21-22 October 2017, I have
returned to OpenDataNI’s
datasets with renewed vigour and fervour. The first of these I’ve turned my
attention to is the recently published Emergency
Care Waiting Times dataset. The Dataset The dataset is just
over 2100 lines, representing monthly entries per Northern Irish hospital and
running from April 2008 to June 2017. The dataset gives the numbers of patients
who waited up to four hours, five to twelve hours, and over twelve hours. A further
column gives the total number of patients in these three categories. As noted, the
hospital name is given for each entry, and this is supplemented by the NHS
Trust they work under and the Type of Emergency Department (1-3) provided. A further column giving the Financial Year is
provided, but was not used. Processing The dataset didn’t require
a…

Screenshot of the Tableau Dashboard. Available [here] and at the end of this post. (Updated: see notes at end)

After much legal
wrangling and foot-dragging, the Northern Ireland Department of the Economy
have finally published a partial list of recipients of money from the botched Renewable Heat Initiative scheme. At present only limited companies and limited liability partnerships who
received in excess of £5,000 (cumulative) are listed. The data runs from the
start of the scheme to 28 February 2017. After the list was published (16 March
2017) a number of people complained that they should treated as individuals,
and not as limited companies. These corrections were made and a second list was
issued the same afternoon. The dataset used here is based on this second list. The first thing I want
to note about the document made available by the Department of the Economy is
that it is presented as a PDF. This is data! To analyse data you need it in a
suitable format such as .xls or .csv. A P…

I first encountered the term ‘Dogfooding’ or ‘to eat your
own dog food’ last year at the 2016 Tableau conference in Austin, TX. I was
attending a session on how Tableau use their own product to visualise their HR
data. The presenter uttered the line ‘We firmly believe in eating our own dog
food’. The meaning was immediately clear to me – if it’s not good enough for
internal usage, it’s hard to make the case that others should invest in the
product. Literally: if it’s not good enough for me, it’s not good enough for my
dog. Great … I get the idea fully & clearly … Unfortunately, I also have a
strong smell-memory of dog food that’s triggered every time I hear the term.
I’m not joking! As I’m writing this the air is heavy with the imagined scent of
tinned dog food … it’s not pleasant! Once the Tableau presentation ended I was free of the term – while
it may have some currency in the US, it seems to be largely absent from the UK
tech vocabulary. I was free!