Open Data, or Opacity through Transparency
Published on June 20, 2018

In the recent years, it has been possible to see how governments and different
public organisms have joined the Open Data: releasing part of their collected
data using an open license to let third parties use it, no matter the purpose:
most of these license have very permissive terms, allowing distribution and not
restricting commercial use. The motivation to release the data can come from
different sources: some of these data can be useful for different organizations
or users, or just form a nice dataset that has no point on being jealously
hidden (for instance, Open Data Madrid has a complete dataset of benches across
the city, including coordinates and model). However, when a government or other
public institution is involved, the motivation is usually the same: the data is
representing something that belongs or heavily influences the general good, and
its release in the form of Open Data is considered most of the times an
exercise of transparency, allowing everyone to freely explore the data or
third-parties to audit it.

This can be seen as utopian or even fallacious: even though most people
nowadays own some kind of computing device, not everyone is a data analyst.
Not everyone will be able to read, process and draw conclusions from a raw-data
file like a CSV. For that reason, visualizations of the data like maps or other
animations are usually created with informative purposes in mind. However, this
is rarely motivated by any other thing than volunteering: students or people
working with data usually do these works in their free time trying to improve
their craft or share knowledge in an easier-to-digest format.

This is closely related to the main point of this post: contributors using Open
Data are (most likely) not being paid to do so, and if they are, their work is
probably going to be used in a commercial product rather than be shared with an
open license. For people trying to practice and improve, they are most likely
just looking for data to work on. If the data is malformed, uncomfortable to
use, or obfuscated; the odds are the contributor is not going to be stubborn
about it: they will ignore the dataset and search a new one.

This is the first point that articulates the article: Open Data should be taken
special care of, allowing people to easily work with it. Fortunately, to
support this claim, I have an entertaining case study - entertaining for you,
because you are going to see me suffer.

The Story: Air Quality in Madrid

In my company in general, and in my unit in particular, we are constantly
trying new techniques of data processing, prescription and prediction. To do
so, we may generate proofs of concept, but as every work related with data
science or analysis, we need actual data to do so. Sometimes we already have
the data, but sometimes we may rely on open datasets like the ones in Kaggle,
for example. However, a week ago we decided to search for a good Madrid Open
Data dataset that aligned with what we needed. Open Data would allow us to
experiment knowing that the resulting proof of concept can then be shared, as a
blog post, in a meetup, etc. Also, and for further reference in the post, my
language of choice for this task is Python 3.

Soon enough, we hit the jackpot: one of the datasets featured hourly levels of
pollution in Madrid during the last 18 years. Reading the description, we can
expect to find not only an hourly, 18 year long time-series, but several: it
explicitly states that the information of several stations are collected. A
related dataset features information about the stations to enrich the analysis,
and if some location is provided, the possibilities are great. At a first
glance in the website we can see that the file is an Excel's very own .xls.
It makes my eyes roll every time, but pandas can deal with it so let's leave
it for now.

There is a description of the file available, but I like to take a look myself
at the data before reading the column description and such. There are 18 zip
files available (one per year) containing 12 files each (per month). The
extension of the files themselves is .txt, but it probably is some kind of
tabular format. Let's open one random one: (notice that the ellipsis means that
the lines are truncated, they are long)

Ok, so there are no headers to name the columns, but we can see that it is a
dot-separated text containing… Wait, what is 09V19? Those are not regular
numbers, is V the separator? That is not something I have seen before, but
maybe the dot is actually a decimal point (which would make sense). But the
first field looks like an ID of some sort that suddenly turns into a floating
point value… It is time to open the data description.

In the PDF we find that it is actually a fixed-width text, which explain why
all those V characters are vertically aligned. In Spanish, it says that:

The first 8 digits are the station ID,

the next 2 digits are the "parameters" (and invites to look at Annex 2),

the next 2 digits are the technique,

the next 2 digits are the analysis period (02 meaning hourly, and since we
have downloaded hourly data this column will always be 02),

the date in YYMMDD format,

And then 5 digits of a level + 1 validation character per measurement, which
happens 24 times.

So, contrary to what we could have expected, each row is a day and has 24
columns, one per hour. That is a… weird way to present the data (if someone
needs that they probably know how to use a pivot table), and inconvenient for
us. Funny enough, the daily data is even weirder following the same pattern:
each row is a month and each column is a day, which means that not all rows
have the same number of columns. We also have a technique and a "parameter"
column, which turned out to be the gas that row is measuring. By taking a look
at the annex, one of the columns seems redundant, since each gas is only
measured using a single technique, so we can discard it. All those V
characters around the text turned out to be "validation characters", that
indicate the measurement is valid. It explicitly states that only V marked
values are valid (so why include the rest of them at all?) so we will get rid
of the rest.

All this process includes some functions that are far from common in Python and
specifically in pandas: this was the first time I heard of
pandas.read_fwf() to read fixed-width text files, and also my first time
using pandas.melt() which basically "unpivots" columns, which I need to get
hours as rows. After doing so, I needed to reconstruct and the date to generate
correct timestamps and I decided to pivot the gases measured in columns (so
that each column is a gas), which makes it easier to get sequential series by
just selecting a column and not querying for rows. It also allows to have
meaningful names in the columns, extracted from the abbreviations in the annex.

This worked on a single file, but I needed to reconstruct the whole dataset
formed by 208 files. The names do not follow any convention at all, but have to
be carefully selected when reading because some folders include XML or CSV
files that can break the script if read by it. Two different parsing methods
have to be implemented because the last years are actually comma-separated but
the columns are divided differently (the station ID is separated in 3 different
fields). Some fields are malformed and therefore a ValueError exception
should be expected and ignored. When doing this, measurements of a gas that is
not mentioned at all in the description (58) appear in certain years, so this
also has to be dropped.

Once this is achieved, the resulting DataFrame is half a GB. For convenience, I
decided to store it in an HDF5 file. This format is hierarchical, compressed
and absurdly fast when accessing sequential rows (which is more than likely
when working in time-series). I decided to split the data into stations, having
each station their very own dataset inside the file, which allows the user to
select all the sequential data of that station and us to discard all entries of
inactivity and all the columns related to gases that the station does not
measure (thus, are completely empty). This little bit of reflecting and savoir
faire results in a single, hierarchical and convenient file of 70MB (7 times
less than the CSV equivalent, 3 times less than the original data).

And we still have to check the stations Excel file. Once we open it, we get
what we expect: a heavily formatted XLS file, which we basically need to
copy-paste without format to a new sheet before reading. However, some
underlying problems appear after a closer look: the coordinates are provided in
degrees-minutes-seconds format instead of decimal (which makes it unusable
as-is and requires parsing and conversion) and the IDs are different. Using a
bit of imagination and detective skills is easy to infer that these IDs are
only the last two digits of the other set of IDs (the rest of it are some
constant codes for the region but here lies the real problem: Not even half of
the stations referenced in the data have this extended information. This is
because some of them have "ceased activity" and no coordinates are provided for
them.

The Open Secret

This took me a whole day at work to clean and understand, a task which most
people would have given up with. Even after all the hard work, there are some
disappointing gotchas like the one about lack of information in the old
stations.

However, it is not like the city council is not well aware of this, it is
probably quite the contrary. In the last page of the aforementioned data
description document, we can see a comprehensive explanation on how to actually
open the files with Excel. Spoiler alert: it takes 16 steps to do so. This is
probably a fair indicator that they do not use this data, at least they
probably don't work with it the same way they expect you to. The current city
council is probably not the (only) one to blame: all this inconvenience and
accidental obfuscation probably have a reason to exist.

A short data exploration reveals some interesting details about the stations.
We can check for null values during the whole period and see them side by side,
to get a better perspective on the activity of each station. Using the
wonderful library missingno we get a fast and easy to understand
visualization of when the data is present in each station.

The resulting image represents data with grey blocks and missing data as gaps,
where we can see an obvious pattern. It seems that in 2009 almost half of the
stations ceased all activity and a good portion of them started. We can see
that only 6 stations span through the whole period, but those were manually
merged (their IDs were renamed and appear with separated IDs in the data, but
the explanation states explicitly that are the same station).

This is may seem familiar if you, dear reader, live in Madrid. In 2009 Madrid
was breaking all contamination records and endured fierce criticism for its
immobility about it: no regulation of the most polluting vehicles in the city
center and a lack of green spaces across it were casting doubt on the city's
sustainability. That same year, the EU recommendations went from a friendly
recommendation to being compulsory, and most of the limits were not being
fulfilled.

The city council's reaction was interesting, to say the least: instead of
taking action against the pollution, their efforts focused on masking the
results. In late 2009 most of the stations were moved to different locations
or simply closed. One of the stations that were closed registered 74 µg/m3
of NO2 when the limit recommended by the EU is 42µg/m3. 18 out of 24
stations registered levels higher than this limit. The stations were moved to
more propitious locations: stations that were located in crossings and
roundabouts were moved to greener areas, which registered lower measurements
when the levels are peaking. Even most of the surviving stations changed which
substances were being measured, which is something that can be seen in the
data. Are the particles smaller than 10µm (PM10) a problem?1 Well, then
stop measuring them altogether.

The Lesson: Exercise Skepticism

In 2010, Madrid lowered its pollution levels a 26% from the previous year. It
is possible to see that change in the data. The problem is that if someone does
not pay enough attention, they can justify it on the 8% fuel consumption drop
(which was a cause of the financial crisis, no official measures whatsoever) or
the specially unstable weather that year (26% more rain than the average). Only
a closer look to the data, the kind of closer look that probably no one using
16 steps in Excel to explore it would take, revealed the truth. And the truth
is that the levels probably remained intact, just the way to measured them
changed. Although a change is seen in the last year (retired measures are added
back to the stations), this dataset is the proof of deceitful tactics that were
used to trick the data and EU measurements.

And this is an important lesson to take into account: open data, just like open
source, relies on contributors to audit and verify it. Its interaction and
exploration is, however, easier to obfuscate: when dealing with great amounts
of data, it is easy to hide flaws and irregularities by just making it
inconvenient to work with the files. For that reason, it is important not to
only demand open data in sensitive matters that affect the common good, the
quality and accessibility of the data is as important. Open data should be
straightforward, and not rely on almost-stenographic designs to hide flaws at
plain sight.

Also, as a final note, notice that the modified data is available in Kaggle
under the name Air Quality in Madrid (2001-2018) for further exploration,
trying to win back all the possible contributors. Even though the change of
trend is dishonest, the data is still interesting and can be useful for all
kinds of time series analysis or prediction.