Category: opendata

Post navigation

Recently I have been named the new chairman of the board of the Open State Foundation. This is a new role I am tremendously looking forward to take up. The Open State Foundation is the leading Dutch NGO concerning government transparency, and over the past years they’ve both persistently and in a very principled way pursued open data and government transparency, as well as constructively worked with government bodies to help them do better. Stef van Grieken, the chairman stepping down, has led the Open State Foundation board since it came into existence. The Open State Foundation is the merger of two earlier NGO’s, The New Voting (Het Nieuwe Stemmen) foundation of which Stef was the founder, and the Hack the Government (Hack de Overheid) collective.

Hack de Overheid emerged from the very first Dutch open government barcamp James Burke, Peter Robinett and I organised in the spring of 2008. The second edition in 2009 was the first Hack de Overheid event. My first open data project that same spring was together with James Burke and Alper Çuğun, both part of Hack de Overheid then and providing the tech savvy, and me being the interlocutor with the Ministry for the Interior, to guide the process and interpret the civil servant speak to the tech guys and vice versa. At the time Elsevier (a conservative weekly) published an article naming me one of the founders of Hack de Overheid, which was true in spirit, if technically incorrect.

In the past year and a half I had more direct involvement with the Open State Foundation than in the years between. Last year I did an in-depth evaluation of the effectiveness and lasting impact of the Open State Foundation in the period 2013-2017 and facilitated a discussion about their future, at the request of their director and one of their major funders. That made me appreciate their work in much richer detail than before. My company The Green Land and Open State Foundation also encounter each other on various client projects, giving me a perspective on the quality of their work and their team.

When Stef, as he’s been working in the USA for the past years, indicated he thought it time to leave the board, it coincided with me having signalled to the Open State Foundation that, if there ever was a need, I’d be happy to volunteer for the board. That moment thus came sooner than I expected. A few weeks ago Stef and I met up to discuss it, and then the most recent board meeting made it official.

Day to day the Open State Foundation is run by a very capable team and director. The board is an all volunteer ‘hands-off’ board, that helps the Open State Foundation guard its mission and maintain its status as a recognised charity in the Netherlands. I’m happy that I can help the Open State Foundation to stay committed to their goals of increasing government transparency and as a consequence the agency of citizens. I’m grateful to Stef, and the others that in the past decade have helped Open State Foundation become what it is now, from its humble beginnings at that barcamp in the run-down pseudo-squat of the former Volkskrant offices, now the hipster Volkshotel. I’m also thankful that I now have the renewed opportunity to meaningfully contribute to something I in a tiny way helped start a decade ago.

Last week I presented to a provincial procurement team about how to better support open data efforts. Below is what I presented and discussed.

Open data as policy instrument and the legal framework demands better procurement

Publishing open data creates new activity. It does so in two ways. It allows existing stakeholders to do more themselves or do things differently. It also allows people who could not participate before become active as well. We’ve seen for instance how opening up provincial and national geographic data increases the independent usage of that data by local governments. We’ve also seen how for instance the Dutch hiking association started using national geographic data to create and better document routes. To the surprise of the Cadastre a whole new area of usage appeared as well, by cultural organisations who before had never requested such data. So open data is an enabler for agency.

If as a government data holder you know this effect takes place, you can also try and achieve it deliberately. For policy domains and groups of stakeholders where you would like to see more activity, publishing data then is an instrument in for instance achieving your own policy goals. Next to regulation and financing, publishing open data is a new third policy instrument. It also happens to be the cheapest of those three to deploy.

Open data in the EU has a legal framework where over time more things are mandated. There is a right to re-use. Upon request dataholders must be able to provide machine readable data formats. In the Netherlands open standards are compulsory for government entities since 2008. Exclusive access to government data for re-use is, except for a few very strictly regulated situations, illegal.

To be able to comply with the legal framework, and to be able to actively use open data as a policy instrument, public sector bodies must pay more attention to how they acquire data, and as a consequence must pay more attention to what happens during procurement processes. If you don’t the government entity’s data sovereignty is strongly diminished, which carries costs.

Procurement awareness needed on multiple levels

The goal is to ensure full data sovereignty. This means paying real attention to various things on different levels of abstraction around procurement.

Ensuring data is received in open standards and regular domain specific standards

Ensure when reports are received that the data used, such as for graphs and tables, are also received

Ensure when information products are received (maps, visualisations) the data used for them are also received

Ensure procurement and collaboration contracts do not preclude sharing data with third parties, apart from on grounds already mentioned as exceptions in the law on freedom of information and re-use

Ensure that when raw data is provided to service providers, that data is still available to the government entity

Ensure that when data is collected by external entities who in turn outsource the collection, all parties involved know the data falls under the decision making power of the government entity

Ensure in collaborations you do not sign away decision power over the data you contribute, you have rights to the data you collectively create, and have as little restriction as possible on the data others contribute.

What could go wrong?

Unless you always pay attention to these points, you run the risk of losing your data sovereignty. This can lead to situations where a government entity is no longer able to comply with its own legal obligations concerning data provision and transparency.

A few existing examples from what can go wrong.

A province is counting bicycle traffic through a network of sensors they deployed themselves. The data is directly transmitted to a service provider in a different country. The province can see dashboards and download reports, but has no access to the sensor data itself, and cannot download the sensor data. While any citizen requesting the data could not be provided with that data, the service provider itself does base commercial services on that and other data it receives, having de facto exclusive access to it.

Another province is outsourcing bird inventory counting to nature preservation organisations, who in turn rely on volunteers to do the bird watching. The province pays for the effort. When it comes to sharing the data publicly, the nature preservation organisations say their volunteers actually own the data, so nothing can be publicly shared. This is untrue for multiple reasons (database rights do not apply, it is a paid for effort so procurement terms that unequivocally transfer such rights should they exist to the province etc), but as the province doesn’t want to waste time on this, nor wants to get into a fight, it leaves it be, resulting in the data not being made available.

An energy network provider pools a lot of different data sources concerning energy usage in their service area from a network of collaborating entities, both private and public. They also publish a lot of open data already. As part of the national effort towards energy transition they receive many data requests from local governments, housing associations and other entities. They would like to provide data, as they see it as a way of contributing to an essential public task (energy transition), but still say no to data requests in 60% of all cases. Because they can’t figure out which contractual obligations apply to which parts of the data, or cannot reconcile conflicting or ambiguous contract clauses concerning the data.

All provinces pool data concerning economic activity and the labor market in a private foundation in which also private entities participate. That foundation sells data subscriptions. Currently they also publish some open data, but if any of the provinces would like to do more, they would have to wait for full agreement. The slowest in the group would determine the actual level of transparency.

A province has outsourced the creation of a ‘heat transition atlas’, in which the potential for moving away from natural gas burning heating systems in homes using various alternatives is mapped. The resulting interactive website contains different data layers, but those data layers are themselves unavailable. Although there is a general list of which data sources have been used, it is not precisely stating its sources and not providing details on how the data has been transformed for the website.

In all cases the public sector data holder has put itself in a position that could have been prevented had they paid more attention at the time of procurement or at the time of entering into collaboration. All these situations can be fixed later on, but they require additional effort, time and costs to arrange, which are unnecessary if dealt with during procurement.

But we have procurement regulations already!

What about procurement regulations. We have those, so don’t they cover all this? Mostly not it turns out.

Terms of procurement talk about rights transfer of all deliverables, but in many cases the data involved isn’t listed as a deliverable, so not covered by those terms.
The terms talk about transfer of database rights, but those hardly ever apply as usually the scale of data collection and structuring into a database is limited.
Concerning research there is some talk about also transferring the data concerned, but a lot of reports aren’t research but consultancy services.

In the general regulations that apply to provincial procurement, the word data only is used in the context of personal data protection, as the dutch plural for date, and in the context of data carriers (hard drives etc). The word standards never occurs, nor does it contain references to data formats (even though legal obligations exist for government entities concerning standards and data formats)

The procurement terms are neither broad enough, nor detailed enough.

How to improve the situation

So what needs to be arranged to ensure government entities arrange their data needs correctly during procurement? How to plug the holes? A few things at the very least:

Likely, when it comes to standards and formats (which may differ per domain), the only viable place is in the mandatory technical requirements in a call for tender / request for proposals.

To get the data behind graphs, tables, info products and reports, including a list of resources and transformations applied, it needs to be specified in the list of deliverables.

Collaboration contracts entered into should always have articles on sharing the data you contribute, being able to share the data resulting from the collaboration, and rules about data that others contribute.

It is important to realise that you cannot through contracts do away with any mandatory transparency, open data, or data governance aspects. Any resulting issues will mean time consuming and likely costly repair activities.

Who needs to be involved

In order to prevent the costs of repair or mitigation of consequences, there are a number of questions concerning who should be doing what, inside a government entity.

What needs to be arranged at the point of tender, who will check it?

What needs to be part of all project starts (e.g. Checklists, data paragraphs), is the project manager aware of this, and who will check it?

Who at the writing and signing of any contract will check data aspects?

Who at the time of delivery will check if data requirements are met?

What part of this is more about awareness and operatios, what needs to be done through regulation?

Our work in the next steps

We intend to assist the province involved in making sure procurement better enables data sharing from now on. Steps we are currently taking to move this forward are:

During his keynote at the Partos Innovation Festival Kenyan designer Mark Kamau mentioned that “45% of Kenya’s GDP was mobile.” That is an impressive statistic, so I wondered if I could verify it. With some public and open data, it was easy to follow up.

World Bank data pegs Kenya’s GDP in 2016 at some 72 billion USD.
Kenya’s central bank publishes monthly figures on the volume of transactions through mobile, and for September 2018 it reports 327 billion KSh, while the lowest monthly figure is February at 300 billion. With 100 Ksh being equivalent to 1 USD, this means the monthly transaction volume exceeds 3 billion USD every month. For a year this means 3*12=36 billion USD, or about half of the 2016 GDP figure. An amazing volume.

For the UNDP in Serbia, I made an overview of existing studies into the impact of open data. I’ve done something similar for the Flemish government a few years ago, so I had a good list of studies to start from. I updated that first list with more recent publications, resulting in a list of 45 studies from the past 10 years. The UNDP also asked me to suggest a measurement framework. Here’s a summary overview of some of the things I formulated in the report. I’ll start with 10 things that make measuring impact hard, and in a later post zoom in on what makes measuring impact doable.

While it is tempting to ask for a ‘killer app’ or ‘the next tech giant’ as proof of impact of open data, establishing the socio-economic impact of open data cannot depend on that. Both because answering such a question is only possible with long term hindsight which doesn’t help make decisions in the here and now, as well as because it would ignore the diversity of types of impacts of varying sizes known to be possible with open data. Judging by the available studies and cases there are several issues that make any easy answers to the question of open data impact impossible.

1 Dealing with variety and aggregating small increments

There are different varieties of impact, in all shapes and sizes. If an individual stakeholder, such as a citizen, does a very small thing based on open data, like making a different decision on some day, how do we express that value? Can it be expressed at all? E.g. in the Netherlands the open data based rain radar is used daily by most cyclists, to see if they can get to the rail way station dry, better wait ten minutes, or rather take the car. The impact of a decision to cycle can mean lower individual costs (no car usage), personal health benefits, economic benefits (lower traffic congestion) environmental benefits (lower emissions) etc., but is nearly impossible to quantify meaningfully in itself as a single act. Only where such decisions are stimulated, e.g. by providing open data that allows much smarter, multi-modal, route planning, aggregate effects may become visible, such as reduction of traffic congestion hours in a year, general health benefits of the population, reduction of traffic fatalities, which can be much better expressed in a monetary value to the economy.

2 Spotting new entrants, and tracking SME’s

The existing research shows that previously inactive stakeholders, and small to medium sized enterprises are better positioned to create benefits with open data. Smaller absolute improvements are of bigger value to them relatively, compared to e.g. larger corporations. Such large corporations usually overcome data access barriers with their size and capital. To them open data may even mean creating new competitive vulnerabilities at the lower end of their markets. (As a result larger corporations are more likely to say they have no problem with paying for data, as that protects market incumbents with the price of data as a barrier to entry.) This also means that establishing impacts requires simultaneously mapping new emerging stakeholders and aggregating that range of smaller impacts, which both can be hard to do (see point 1).

3 Network effects are costly to track

The research shows the presence of network effects, meaning that the impact of open data is not contained or even mostly specific to the first order of re-use of that data. Causal effects as well as second and higher order forms of re-use regularly occur and quickly become, certainly in aggregate, much higher than the value of the original form of re-use. For instance the European Space Agency (ESA) commissioned my company for a study into the impact of open satellite data for ice breakers in the Gulf of Bothnia. The direct impact for ice breakers is saving costs on helicopters and fuel, as the satellite data makes determining where the ice is thinnest much easier. But the aggregate value of the consequences of that is much higher: it creates a much higher predictability of ships and the (food)products they carry arriving in Finnish harbours, which means lower stocks are needed to ensure supply of these goods. This reverberates across the entire supply chain, saving costs in logistics and allowing lower retail prices across Finland. When mapping such higher order and network effects, every step further down the chain of causality shows that while the bandwidth of value created increases, at the same time the certainty that open data is the primary contributing factor decreases. Such studies also are time consuming and costly. It is often unlikely and unrealistic to expect data holders to go through such lengths to establish impact. The mentioned ESA example, is part of a series of over 20 such case studies ESA commissioned over the course of 5 years, at considerable cost for instance.

4 Comparison needs context

Without context, of a specific domain or a specific issue, it is hard to asses benefits, and compare their associated costs, which is often the underlying question concerning the impact of open data: does it weigh up against the costs of open data efforts? Even though in general open data efforts shouldn’t be costly, how does some type of open data benefit compare to the costs and benefits of other actions? Such comparisons can be made in a specific context (e.g. comparing the cost and benefit of open data for route planning with other measures to fight traffic congestion, such as increasing the number of lanes on a motor way, or increasing the availability of public transport).

Because open data provisioning is a prerequisite for it having any impact, the availability of data and the maturity of open data efforts determine not only how much impact can be expected, but also determine what can be measured (mature impact might be measured as impact on e.g. traffic congestion hours in a year, but early impact might be measured in how the number of re-users of a data set is still steadily growing year over year)

Whether open data creates much impact is not only dependent on the availability of open data and the maturity of the supply-side, even if it is as mentioned a prerequisite. Impact, judging by the existing research, is certain to emerge, but the size and timing of such impact depends on a wide range of other factors on the demand-side as well, including things as the skills and capabilities of stakeholders, time to market, location and timing. An idea for open data re-use that may find no traction in France because the initiators can’t bring it to fruition, or because the potential French demand is too low, may well find its way to success in Bulgaria or Spain, because local circumstances and markets differ. In the Serbian national open data readiness assessment performed by me for the World Bank and the UNDP in 2015 this is reflected in the various dimensions assessed, that cover both supply and demand, as well as general aspects of Serbian infrastructure and society.

7 We don’t understand how infrastructure creates impact

The notion of broad open data provision as public infrastructure (such as the UK, Netherlands, Denmark and Belgium are already doing, and Switzerland is starting to do) further underlines the difficulty of establishing the general impact of open data on e.g. growth. The point that infrastructure (such as roads, telecoms, electricity) is important to growth is broadly acknowledged, with the corresponding acceptance of that within policy making. This acceptance of quantity and quality of infrastructure increasing human and physical capital however does not mean that it is clear how much what type of infrastructure contributes at what time to economic production and growth. Public capital is often used as a proxy to ascertain the impact of infrastructure on growth. Consensus is that there is a positive elasticity, meaning that an increase in public capital results in an increase in GDP, averaging at around 0.08, but varying across studies and types of infrastructure. Assuming such positive elasticity extends to open data provision as infrastructure (and we have very good reasons to do so), it will result in GDP growth, but without a clear view overall as to how much.

Most measurements concerning open data impact need to be understood as proxies. They are not measuring how open data is creating impact directly, but from measuring a certain movement it can be surmised that something is doing the moving. Where opening data can be assumed to be doing the moving, and where opening data was a deliberate effort to create such movement, impact can then be assessed. We may not be able to easily see it, but still it moves.

9 Motives often shape measurements

Apart from the difficulty of measuring impact and the effort involved in doing so, there is also the question of why such impact assessments are needed. Is an impact assessment needed to create support for ongoing open data efforts, or to make existing efforts sustainable? Is an impact measurement needed for comparison with specific costs for a specific data holder? Is it to be used for evaluation of open data policies in general? In other words, in whose perception should an impact measurement be meaningful?
The purpose of impact assessments for open data further determines and/or limits the way such assessments can be shaped.

10 Measurements get gamed, become targets

Finally, with any type of measurement, there needs to be awareness that those with a stake of interest into a measurement are likely to try and game the system. Especially so where measurements determine funding for further projects, or the continuation of an effort. This must lead to caution when determining indicators. Measurements easily become a target in themselves. For instance in the early days of national open data portals being launched worldwide, a simple metric often reported was the number of datasets a portal contained. This is an example of a ‘point’ measurement that can be easily gamed for instance by subdividing a dataset into several subsets. The first version of the national portal of a major EU member did precisely that and boasted several hundred thousand data sets at launch, which were mostly small subsets of a bigger whole. It briefly made for good headlines, but did not make for impact.

In a second part I will take a closer look at what these 10 points mean for designing a measurement framework to track open data impact.

I took part in a panel to discuss the opportunities of open data at regional level. The other panelists were my Serbian UNDP colleague Slobodan Markovic, Brigitte Lutz of the Vienna open data portal (whom I hadn’t met in years), Margreet Nieuwenhuis of the European open data portal, and Geert-Jan Waasdorp who uses open data about the European labour market commercially.

Below are the notes I used for my panel contributions:

Open data is a key building block for any policy plan. The Serbian government certainly treats it as such, judging by the PM’s message we just heard, and the same should be true for regional governments.

Open data from an organisational stand point is only sustainable if it is directly connected to primary policy processes, and not just an additional step or effort after the ‘real’ work has been done. It’s only sustainable if it means something for your own work as regional administration.

We know that open data allows people and organisations to take new actions. These by themselves or in aggregate have impact on policy domains. E.g. parents choosing schools for their children or finding housing, multimodal route planning, etc.

So if you know this effect exists, you can use it on purpose. Publish data to enable external stakeholders. You need to ask yourself: around which policy issues do you want to enable more activity? Which stakeholders do you want to enable or nudge? Which data will be helpful for that, if put into the hands of those stakeholders?

This makes open data a policy instrument. Next to funding and regulation, publishing open data for others to use is a way to influence stakeholder behaviour. By enabling them and partnering with them.
It is actually your cheapest policy instrument, as the cost of data collection is always a sunk cost as part of your public task

Positioning open data this way, as a policy instrument, requires building connections between your policy issues, external stakeholders and their issues, and the data relevant in that context.

This requires going outside and listen to stakeholders and understand the issues they want to solve, the things they care about. You need to avoid making any assumptions.

We worked with various regional governments in the Netherlands, including the two Dutch AER members Flevoland and Gelderland. With them we learned that having those outside conversations is maybe the hardest part. To create conversations between a policy domain expert, an internal data expert, and the external stakeholders. There’s often a certain apprehension to reach out like that and have an open ended conversation on equal footing. From those conversations you learn different things. That your counterparts are also professionals interested in achieving results and using the available data responsibly. That the ways in which others have shaped their routines and processes are usually invisible to you, and may be surprising to you.
In Flevoland there’s a program for large scale maintenance on bridges and water locks in the coming 4 years. One of the provincial aims was to reduce hindrance. But an open question was what constitutes hindrance to different stakeholders. Only by talking to e.g. farmers it became clear that the maintenance plans themselves were less relevant than changes in those plans: a farmer rents equipment a week before some work needs to be done on the fields. If within that week a bridge unexpectedly becomes blocked, it means he can’t reach his fields with the rented equipment and damage is done. Also relevant is exploring which channels are useful to stakeholders for data dissemination. Finding channels that are used already by stakeholders or channels that connect to those is key. You can’t assume people will use whatever special channel you may think of building.

Whether it is about bridge maintenance, archeology, nitrate deposition, better usage of Interreg subsidies, or flash flooding after rain fall, talking about open data in terms of innovation and job creation is hollow and meaningless if it is not connected to one of those real issues. Only real issues motivate action.

Complex issues rarely have simple solutions. That is true for mobility, energy transition, demographic pressure on public services, emission reduction, and everything else regional governments are dealing with. None of this can be fixed by an administration on its own. So you benefit from enabling others to do their part. This includes local governments as stakeholder group. Your own public sector data is one of the easiest available enables in your arsenal.

Dutch Provinces publish open data, but it always looks like it is mostly geo-data, and hardly anything else. When talking to provinces I also get the feeling they struggle to think of data that isn’t of a geographic nature. That isn’t very surprising, a lot of the public tasks carried out by provinces have to do with spatial planning, nature and environment, and geographic data is a key tool for them. But now that we are aiding several provinces with extending their data provision, I wanted to find out in more detail.

My colleague Niene took the API of the Dutch national open data portal for a spin, and made a list of all datasets listed as stemming from a province.
I took that list and zoomed in on various aspects.

At first glance there are strong differences between the provinces: some publish a lot, others hardly anything. The Province of Utrecht publishes everything twice to the national data portal, once through the national geo-register, once through their own dataplatform. The graph below has been corrected for it.

What explains those differences? And what is the nature of the published datasets?

Geo-data is dominant
First I made a distinction between data that stems from the national geo-register to which all provinces publish, and data that stems from another source (either regional dataplatforms, or for instance direct publication through the national open data portal). The NGR is theoretically the place where all provinces share geo-data with other government entities, part of which is then marked as publicly available. In practice the numbers suggest Provinces roughly publish to the NGR in the same proportions as the graph above (meaning that of what they publish in the NGR they mark about the same percentage as open data)

Of the over 3000 datasets that are published by provinces as open data in the national open data portal, only 48 don’t come from the national geo-register. This is about 1.5%.

Of the 12 provinces, 4 do not publish anything outside the NGR: Noord-Brabant, Zeeland, Flevoland, Overijssel.

Drenthe stands out in terms of numbers of geo-data sets published, over 900. A closer look at their list shows that they publish more historic data, and that they seem to be more complete (more of what they share in the NGR is marked for open data apparantly.) The average is between 200-300, with provinces like Zuid-Holland, Noord-Holland, Gelderland, Utrecht, Groningen, and Fryslan in that range. Overijssel, like Drenthe publishes more, though less than Drenthe at about 500. This seems to be the result of a direct connection to the NGR from their regional geo-portal, and thus publishing by default. Overijssel deliberately does not publish historic data explaining some of the difference with Drenthe. (When something is updated in Overijssel the previous version is automatically removed. This clashes with open data good practice, but is currently hard to fix in their processes.)

If it isn’t geo, it hardly exists
Of the mere 48 data sets outside the NGR, just 22 (46%) are not geo-related. Overall this means that less than 1% of all open data provinces publish is not geo-data.
Of those 22, exactly half are published by Zuid-Holland alone. They for instance publish several photo-archives, a subsidy register, politician’s expenses, and formal decisions.
Fryslan is the only province publishing an inventory of their data holdings, which is 1 of their only 3 non geo-data sets.
Gelderland stands out as the single province that publishes all their geo data through the NGR, hinting at a neatly organised process. Their non-NGR open data is also all non-geo (as it should be). They publish 27% of all open non-geo data by provinces, together with Zuid-Holland account for 77% of it all.

Taking these numbers and comparing them to inventories like the one Fryslan publishes (which we made for them in 2016), and the one for Noord-Holland (which we did in 2013), the dominance of geo-data is not surprising in itself. Roughly 80% of data provinces hold is geo related. Just about a fifth to a quarter of this geo-data (15%-20% of the total) is on average published at the moment, yet it makes up over 99% of all provincial open data published. This lopsidedness means that hardly anything on the inner workings of a province, the effectivity of policy implementation etc. is available as open data.

Where the opportunities are
To improve both on the volume and on the breadth of scope of the data provinces publish, two courses of action stand open.
First, extending the availability of geo-data provinces hold. Most provinces will have a clear process for this, and it should therefore be relatively easy to do. It should therefore be possible for most provinces to get to where Drenthe currently is.
Second, take a much closer look at the in-house data that is not geo-related. About 20% of dataholdings fall in this category, and based on the inventories we did, some 90% of that should be publishable, maybe after some aggregation or other adaptations.
The lack of an inventory is an obstacle here, but existing inventories should at least be able to point the other provinces in the right direction.

Make the provision of provincial open geodata complete, embrace its dominance and automate it with proper data governance. Focus your energy on publishing ‘the rest’ where all the data on the inner workings of the province is. Provinces perpetually complain nobody is aware of what they are doing and their role in Dutch governance. Make it visible, publish your data. Stop making yourself invisible behind a stack of maps only.

Good and frank conversation today with someone at the European Parliament working on the planned EP’s response to the European Commission’s new proposal for the PSI Directive. Will put more thoughts to paper and publish early August.

For the Province of South-Holland we’re currently helping them to extend their open data provision. Next to looking at data they hold relevant to key policy domains, we also look at what other data is available elsewhere for those domains. For instance nationwide datasets with local granular level of detail. In those cases it can be of interest to take the subset relevant for the Province and republish that through their own channels.

One of the relevant topics is energy transition (to sustainable energy sources). Current and historic household usage is of interest here. The companies that maintain the grid publish yearly data per postcode, or at least some of them do. There are seven of these companies.
Luckily all three companies active in South-Holland do publish that data.

Having this subset of data is useful for any organisation in the region that wants to limit the amount of data they have to dig through to get what they need, for the provincial organisation itself, and for individual citizens. Households that have digital meters have access to their daily energy usage readings online. This data allows them to easily compare their personal usage with their neighbours and wider surrounding area. For instance I established that our usage is lower for both electricity and gas than average in our street. It is also easier to map, or otherwise visualise, in a meaningful way for the province and relevant regional stakeholders.

Here’s a brief overview of the steps we’re taking to get to a province-wide data set.

Download the data for the years available for Westland, Liander and Stedin (Westland goes back to 2010, the others to 2008)

Check the data formats: Westland and Stedin provide CSV, Liander XLSX

Check data structure: all use the same structure of fields and conventions

To get only the data for South-Holland we use the postcode that is mentioned in the data.

The Dutch postcode zones do not conform to provincial boundaries however, so we take the list of four position postcodes and determine the ones that fall within South-Holland:

1428-1429

2159-2164

2170-3381

3465-3466

4126-4129

4140-4146

4163-4169

4200-4209

4213

4220-4249

The data contains 6 position postcodes of the structure 1234AB. We need to split them into the four digits and the two letters, to be able to match them with the ranges that fall within the province.

For personal data protection purposes, in the data, for 6 position postcodes where the number of addresses in that postcode is less than 10, the data is aggregated with a neighbouring postcode, until the number of addresses is higher than 9. It is not certain that those aggregations fall within a single province. The data provides a ‘from’ 6 position postcode and a ‘to’ 6 position postcode. This is the same value where the number of addresses in a postcode is high enough but can be a wider range.

We need to test if the entire postcode range in a single data record falls within one of the ranges of postcodes that belong in South-Holland.

For the small number of aggregates that fall into two provinces we can adopt the average usage number, but need to mark that the number of households in that area is unknown,

or retrieve the actual number of addresses from the national address and building database, and mark that the average energy usage values are from a larger number of addresses.

Alternatively we can keep the entire range, including the part outside the province,

or we exclude the entire range and leave a ‘hole in the map’.

In any case we need to mark in the data what we did, and why.

The result is then a data set in CSV that consolidates the three sources for all those records that fall within the province.

This dataset can then be mapped, e.g. in Q-GIS or other tools in use within the province South-Holland.

We provide a recipe and/or script from the above steps that can take the future yearly data sets from the three sources and turn them into a consolidated subset for South-Holland, so that the province can automate keeping the data up to date.

Today I contributed to a session of the open data research groups at Delft University. They do this a few times per year to discuss ongoing research and explore emerging questions that can lead to new research. I’ve taken part a few times in the past, and this time they asked me to provide an overview of what I see as current developments.

Some of the things I touched upon are similar to the remarks I made in Serbia during Open Data Week in Belgrade. The new PSI Directive proposal also was on the menu. I ended with the questions I think deserve attention. They are either about how to make sure that abstract norms get translated to the very practical, and to the local level inside government, or how to ensure that critical elements get connected and visibly stay that way (such as links between regular policy goals / teams and information management)

The slides are embedded below.

[slideshare id=102667069&doc=tudopenquestions-180619173722]

Iryna Susha and Bastiaan van Loenen in the second part of our afternoon took us through their research into the data protection steps that are in play in data collaboratives. This I found very worthwile, as data governance issues of collaborative groups (e.g. public and private entities around energy transition) are regularly surfacing in my work. Both where it threatens data sovereignty for instance, or where collaboratively pooled data can hardly be shared because it has become impossible to navigate the contractual obligations connected to the data that was pooled.

TL;DR

The European Commission proposed a new PSI Directive, that describes when and how publicly held data can be re-used by anyone (aka open government data). The proposal contains several highly interesting elements: it extends the scope to public undertakings (utilities and transport mostly) and research data, it limits the ways in which government can charge for data, introduces a high value data list which must be freely and openly available, mandates API’s, and makes de-facto exclusive arrangements transparant. It also calls for delegated powers for the EC to change practical details of the Directive in future, which opens interesting possibilities. In the coming months (years) it remains to be seen what the Member States and the European Parliament will do to weaken or strengthen this proposal.

Changes in the PSI Directive announced

On 25 April, the European Commission announced new measures to stimulate the European data economy, said to be building on the GDPR, as well as detailing the European framework for the free flow of non-personal data. The EC announced new guidelines for the sharing of scientific data, and for how businesses exchange data. It announced an action plan that increases safeguards on personal data related to health care and seeks to stimulate European cooperation on using this data. The EC also proposes to change the PSI Directive which governs the re-use of public sector information, commonly known as Open Government Data. In previous months the PSI Directive was evaluated (see an evaluation report here, in which my colleague Marc and I were involved)

This post takes a closer look at what the EC proposes for the PSI Directive. (I did the same thing when the last version was published in 2013)
This is of course a first proposal from the EC, and it may significantly change as a result of discussions with Member States and the European Parliament, before it becomes finalised and enters into law. Taking a look at the proposed new directive is of interest to see what’s new, what from an open data perspective is missing, and to see where debate with MS is most likely. Square bullets indicate the more interesting changes.

The Open Data yardstick

The original PSI Directive was adopted in 2003 and a revised version implemented in 2015. Where the original PSI Directive stems from well before the emergence of the Open Data movement, and was written with mostly ‘traditional’ and existing re-users of government information in mind, the 2015 revision already adopted some elements bringing it closer to the Open Definition. With this new proposal, again the yardstick is how it increases openness and sets minimum requirements that align with the open definition, and how much of it will be mandatory for Member States. So, scope and access rights, redress, charging and licensing, standards and formats are important. There are also some general context elements that stand out from the proposal.

A floor for the data-based society

In the recital for the proposal what jumps out is a small change in wording concerning the necessity of the PSI Directive. Where it used to say “information and knowledge” it now says “the evolution towards a data-based society influences the life of every citizen”. Towards the end of the proposal it describes the Directive as a means to improve the proper functioning of the European data economy, where it used to read ‘content industry’. The proposed directive lists minimum requirements for governments to provide data in ways that enable citizens and economic activity, but suggests Member States can and should do more, and not just stick with the floor this proposal puts in place.

There are a few novel elements spread out through the proposal that are of interest, because they seem intended to make the PSI Directive more flexible with an eye to the future.

The EC proposal ads the ability to create delegated acts. This would allow practical changes without the need to revise the PSI Directive and have it transposed into national law by each Member States. While this delegated power cannot be used to change the principles in the directive, it can be used to tweak it. Concerning charging, scope, licenses and formats this would provide the EC with more elbow room than the existing ability to merely provide guidance. The article is added to be able to maintain a list of ‘high value data sets’, see below.

Public undertakings are defined and mentioned in parallel to public sector bodies in each provision . Public undertakings are all those that are (in)directly owned by government bodies, significantly financed by them or controlled by them through regulation or decision making powers. It used to say only public sector, basically allowing governments to withdraw data from the scope of the Directive by putting them at a distance in a private entity under government control. While the scope is enlarged to include public undertakings in specific sectors only, the rest of the proposal refers to public undertakings in general. This is significant I think, given the delegated powers the EC also seeks.

Dynamic and real-time data is brought firmly in scope of the Directive. There have been court cases where data provision was refused on the grounds that the data did not exist when the request was made. That will no longer be possible with this proposal.

The EC wants to make a list of ‘high value datasets’ for which more things are mandatory (machine readable, API, free of charge, open standard license). It will create the list through the mentioned delegated powers. In my experience deciding on high value data sets is problematic (What value, how high? To whom?) and reinforces a supply-side perspective more over a demand driven approach. The Commission defines high value as “being associated with important socio-economic benefits” due to their suitability for creating services, and “the number of potential beneficiaries” of those services based on these data sets.

Access rights and scope

Public undertakings in specific sectors are declared within scope. These sectors are water, gas/heat, electricity, ports and airports, postal services, water transport and air transport. These public undertakings are only within scope in the sense that requests for re-use can be submitted to them. They are under no obligation to release data.

Research data from publicly funded research that are already made available e.g. through institution repositories are within scope. Member States shall adopt national policies to make more research data available.

A previous scope extension (museums, archives, libraries and university libraries) is maintained. For educational institutions a clarification is added that it only concerns tertiary education.

The proposed directive builds as before on existing access regimes, and only deals with the re-use of accessible data. This maintains existing differences between Member States concerning right to information.

Public sector bodies, although they retain any database rights they may have, cannot use those database rights to prevent or limit re-use.

Asking for documents to re-use, and redress mechanisms if denied

The way in which citizens can ask for data or the way government bodies can respond, has not changed

The redress mechanisms haven’t changed, and public undertakings, educational institutes research organisations and research funding organisations do not need to provide one.

Charging practices

The proposal now explicitly mentions free of charge data provision as the first option. Fees are otherwise limited to at most ‘marginal costs’

The marginal costs are redefined to include the costs of anonymizing data and protecting commercially confidential material. The full definition now reads “ marginal costs incurred for their reproduction, provision and dissemination and where applicable anonymisation of personal data and measures to protect commercially confidential information.” While this likely helps in making more data available, in contrast to a blanket refusal, it also looks like externalising costs on the re-user of what is essentially badly implemented data governance internally. Data holders already should be able to do this quickly and effectively for internal reporting and democratic control. Marginal costing is an important principle, as in the case of digital material it would normally mean no charges apply, but this addition seems to open up the definition to much wider interpretation.

The ‘marginal costs at most’ principle only applies to the public sector. Public undertakings and museum, archives etc. are excepted.

As before public sector bodies that are required (by law) to generate revenue to cover the costs of their public task performance are excepted from the marginal costs principle. However a previous exception for other public sector bodies having requirements to charge for the re-use of specific documents is deleted.

The total revenue from allowed charges may not exceed the total actual cost of producing and disseminating the data plus a reasonable return on investment. This is unchanged, but the ‘reasonable return on investment’ is now defined as at most 5 percentage points above the ECB fixed interest rate.

Re-use of research data and the high value data-sets must be free of charge. In practice various data sets that are currently charged for are also likely high value datasets (cadastral records, business registers for instance). Here the views of Member States are most likely to clash with those of the EC

Licensing

The proposal contains no explicit move towards open licenses, and retains the existing rules that standard license should be available, and those should not unnecessarily restrict re-use, nor restrict competition. The only addition is that Member States shall not only encourage public sector bodies but all data holders to use such standard licenses

High value data sets must have a license compatible with open standard licenses.

Non-discrimination and Exclusive agreements

Non-discrimination rules in how conditions for re-use are applied, including for commercial activities by the public sector itself, are continued

Exclusive arrangements are not allowed for public undertakings, as before for the public sector, with the same existing exceptions.

Where new exclusive rights are granted the arrangements now need to made public at least two months before coming into force, and the final terms of the arrangement need to be transparant and public as well.

Important is that any agreement or practical arrangement with third parties that in practice results in restricted availability for re-use of data other than for those third parties, also must be published two months in advance, and the final terms also made transparant and public. This concerns data sharing agreements and other collaborations where a few third parties have de facto exclusive access to data. With all the developments around smart cities where companies e.g. have access to sensor data others don’t, this is a very welcome step.

Formats and standards

Public undertakings will need to adhere to the same rules as the public sector already does: open standards and machine readable formats should be used for both documents and their metadata, where easily possible, but otherwise any pre-existing format and language is acceptable.

Both public sector bodies and public undertakings should provide API’s to dynamic data, either in real time, or if that is too costly within a timeframe that does not unduly impair the re-use potential.

High value data sets must be machine readable and available through an API

Let’s see how the EC takes this proposal forward, and what the reactions of the Member States and the European Parliament will be.

The US government is looking at whether to start asking money again for providing satellite imagery and data from Landsat satellites, according to an article in Nature.

Officials at the Department of the Interior, which oversees the USGS, have asked a federal advisory committee to explore how putting a price on Landsat data might affect scientists and other users; the panel’s analysis is due later this year. And the USDA is contemplating a plan to institute fees for its data as early as 2019.

To “explore how putting a price on Landsat data might affect” the users of the data, will result in predictable answers, I feel.

Public digital government held data, such as Landsat imagery, is both non-rivalrous and non-exclusionary.

The initial production costs of such data may be very high, and surely is in the case of satellite data as it involves space launches. Yet these costs are made in the execution of a public and mandated task, and as such are sunk costs. These costs are not made so others can re-use the data, but made anyway for an internal task (such as national security in this case).

The copying costs and distribution costs of additional copies of such digital data is marginal, tending to zero

Government held data usually, and certainly in the case of satellite data, constitute a (near) monopoly, with no easily available alternatives. As a consequence price elasticity is above 1: when the price of such data is reduced, the demand for it will rise non-lineary. The inverse is also true: setting a price for government data that currently is free will not mean all current users will pay, it will mean a disproportionate part of current usage will simply evaporate, and the usage will be much less both in terms of numbers of users as well as of volume of usage per user.

Data sales from one public entity to another publicly funded one, such as in this case academic institutions, are always a net loss to the public sector, due to administration costs, transaction costs and enforcement costs. It moves money from one pocket to another of the same outfit, but that transfer costs money itself.

The (socio-economic) value of re-use of such data is always higher than the possible revenue of selling that data. That value will also accrue to the public sector in the form of additional tax revenue. Loss of revenue from data sales will always over time become smaller than that. Free provision or at most at marginal costs (the true incremental cost of providing the data to one single additional user) is economically the only logical path.

Additionally the value of data re-use is not limited to the first order of re-use (in this case e.g. academic research it enables), but knows “downstream” higher order and network effects. E.g. the value that such academic research results create in society, in this case for instance in agriculture, public health and climatic impact mitigation. Also “upstream” value is derived from re-use, e.g. in the form of data quality improvement.

This precisely was why the data was made free in 2008 in the first place:

Since the USGS made the data freely available, the rate at which users download it has jumped 100-fold. The images have enabled groundbreaking studies of changes in forests, surface water, and cities, among other topics. Searching Google Scholar for “Landsat” turns up nearly 100,000 papers published since 2008.

That 100-fold jump in usage? That’s the price elasticity being higher than 1, I mentioned. It is a regularly occurring pattern where fees for data are dropped, whether it concerns statistics, meteo, hydrological, cadastral, business register or indeed satellite data.

The economic benefit of the free Landsat data was estimated by the USGS in 2013 at $2 billion per year, while the programme costs about $80 million per year. That’s an ROI factor for US Government of 25. If the total combined tax burden (payroll, sales/VAT, income, profit, dividend etc) on that economic benefit would only be as low as 4% it still means it’s no loss to the US government.

It’s not surprising then, when previously in 2012 a committee was asked to look into reinstating fees for Landsat data, it concluded

“Landsat benefits far outweigh the cost”. Charging money for the satellite data would waste money, stifle science and innovation, and hamper the government’s ability to monitor national security, the panel added. “It is in the U.S. national interest to fund and distribute Landsat data to the public without cost now and in the future,”

What kind of energy consumption data do you have at a postal code level in NL? Are your energy utilities public bodies?
Our electricity provider, and our oil and propane companies are all private, and do not release consumption data; our water utility is public, but doesn’t release consumption data and is not subject (yet) to freedom of information laws.

Let’s provide some answers.

Postal codes

Dutch postal codes have the structure ‘1234 AB’, where 12 denotes a region, 1234 denotes a village or neighbourhood, and AB a street or a section of a street. This makes them very useful as geographic references in working with data. Our postal code begins with 3825, which places it in the Vathorst neighbourhood, as shown on this list. In the image below you see the postal code 3825 demarcated on Google maps.

Postal codes are both commercially available as well as open data. Commercially available is a full set. Available as open data are only those postal codes that are connected to addresses tied to physical buildings. This as the base register of all buildings and addresses are open data in the Netherlands, and that register includes postal codes. It means that e.g. postal codes tied to P.O. Boxes are not available as open data. In practice getting at postal codes as open data is still hard, as you need to extract them from the base register, and finding that base register for download is actually hard (or at least used to be, I haven’t checked back recently).

On Energy Utilities

All energy utilities used to be publicly owned, but have since been privatised. Upon privatisation all utilities were separated into energy providers and energy transporters, called network maintainers. The network maintainers are private entities, but are publicly owned. They maintain both electricity mains as well as gas mains. There are 7 such network maintainers of varying sizes in the Netherlands

The three biggest are Liander, Enexis and Stedin.
These network maintainers, although publicly owned, are not subject to Freedom of Information requests, nor subject to the law on Re-use of Government Information. Yet they do publish open data, and are open to data requests. Liander was the first one, and Enexis and Stedin both followed. The motivation for this is that they have a key role in the government goal of achieving full energy transition by 2050 (meaning no usage of gas for heating/cooking and fully CO2 neutral), and that they are key stakeholders in this area of high public interest.

Household Energy Usage Data

Open data is published by Liander, Enexis and Stedin, though not all publish the same type of data. All publish household level energy usage data aggregated to the level of 6 position postal codes (1234 AB), in addition to asset data (including sub soil cables etc) by Enexis and Stedin. The service areas of all 7 network maintainers are also open data. The network maintainers are also all open to additional data requests, e.g. for research purposes or for municipalities or housing associations looking for data to pan for energy saving projects. Liander indicated to me in a review for the European Commission (about potential changes to the EU public data re-use regulations), that they currently deny about 2/3 of data requests received, mostly because they are uncertain about which rules and contracts apply (they hold a large pool of data contributed by various stakeholders in the field, as well as all remotely read digital metering data). They are investigating how to improve on that respons rate.

Some postal code areas are small and contain only a few addresses. In such cases this may lead to personally identifiable data, which is not allowed. Liander, Stedin and I assume Enexis as well, solve this by aggregating the average energy usage of the small area with an adjacent area until the number of addresses is at least 10.

Our address falls in the service area of Stedin. The most recent data is that of January 1st 2018, containing the energy use for all of 2017. Searching for our postal code (which covers the entire street) in their most recent CSV file yields on lines 151.624 and 625:

click to enlarge

The first line shows electricity usage (ELK), and says there are 33 households in the street, and the avarage yearly usage is 4599kWh. (We are below that at around 3700kWh / year, which is higher than we were used to in our previous home). The next line provides the data for gas usage (heating and cooking) “GAS”, which is 1280 m3 on average for the 33 connections. (We are slightly below that at 1200 m3).

At the edge of our neighbourhood, on a section of grassland, there are plans to create a solar farm. This is a temporary set-up as the land will eventually be used to build houses. Those living in the houses overlooking those fields started a petition as they fear it diminishes their view. There’s a whiff of nimby here, but it’s also justified resistance as it flies in the face of an earlier two year long participatory project by the city to determine with those who live here how to use those fields.

The petition I think didn’t gather a lot of signatures (just over 1100 now). I somewhat tongue in cheek asked the initiators online if there was also a petition I could sign in favour of the solar fields. The Netherlands after all is running far behind its own goals concerning renewables so I feel action on a wider scale is needed.

This led to forming a small group of people looking into what can be done towards more solar using existing roofs in our neighbourhood. A constructive outcome I think, even if I have little real time to contribute. In conversation with the group I offered to look into what data might be helpful, to both determine the actual potential of solar energy in our location (how much irradience hits the surface here, and what yield does that make possible), and the latent potential (based on the current energy usage at household level in our part of town.

Data on irradience is available. As is household electricity usage on postcode level, which means more or less to block level. What I haven’t really looked at if there is open data concerning roof space. The base register for buildings and addresses contains the shapes of buildings for every building in the Netherlands, but that is only in 2D, so it doesn’t provide the shape of non-flat roofs. Getting the roof shapes would require combining the BAG with AHN, the lidar scan of the Netherlands that contains all heights (trees, buildings and whatnot). The AHN however is created as snapshots. Our area is actively being developed, and houses are continuously being added. The latest AHN scan of our area was in 2010, so is heavily outdated. Luckily the new AHN3 (the 3rd AHN) scans for this region are scheduled for this year, and will be made available as open data. So at least we’ll have recent data to work with.

I intend to play around with this data to see if something can be said about potential and latent demand for solar energy in our area.

Post navigation

Subscribe to my blog

About

Blog Interdependent Thoughts maintained since 2002 by Ton Zijlstra. European citizen in a networked world. Based in the Netherlands, living in Europe, working globally. There are no Others. There is just me and many of you.

I write about how our digital and networked world changes how we work, learn, decide and organize. I explore the tools and strategies that help us navigate the networked world.
I am passionate about increasing people's ability to act (knowledge), and their ability to change (learning). Key-words: open data, open government, fablabs, making, complexity, networked agency, networked learning, ethics by design.