As Friedrich Lindenberg was writing this abstruse code on his MacBook plugged into the beamer at the workshop on EU spending on 9 September, 20 journalists listened attentively as data started to come alive before their eyes. In a conference room in Utrecht University’s 15th-century Faculty Club, the group from across Europe watched as Lindenberg compared a list of lobbying firms with the list of accredited experts at the European Commission: Any overlap would clearly suggest a conflict of interest.

Aside from just watching, the audience actually followed in Lindenberg’s steps on Google Refine, an Excel-like tool, and were taming the data on their own laptops. At this point in time, more journalists were engaging in data-mining in Utrecht than in any other newsroom. This practical exercise was the climax of two days of learning to investigate the mountains of data produced by European institutions. Besides Lindenberg, the coder behind Open Spending, EU datajournalist Caelainn Barr, OpenCorporates founder Chris Taggart and Erik Wesselius of Corporate Europe shared expertise with participants.

The EU budget has the advantage of being massive (€120 billion) and fairly open, compared to what a journalist can get from most national governments. It was the perfect topic for the European Journalism Centre and the Open Knowledge Foundation to bring together open data and data journalism. It was also a perfect topic for participants, whose ideas, depicted on the mind-map above, rushed in the first brainstorming session from health issues to real-time data to the always-fascinating lobbyists and regional grants.

Good reporting needed

Journalists reporting on the EU budget face an uphill struggle. Knowledge of the budget among Europeans is abysmally poor. 1 in 3 Europeans has never heard of an EU budget and less than 1 in 4 of those who have knows that most of the budget is spent on agriculture. More interestingly, the graph shows that the level of ignorance remained fairly constant for the past 10 years with the solidly-anchored belief that administrative costs represent the lion’s share of the EU budget (the actual figure is 6%).

A lack of access to clear and clean data might be one of the reasons why representations of the EU budget are so far off the mark. Ron Korver, press officer at the EU Parliament, opened the workshop by explaining that EU institutions are sometimes reluctant to giving a clear picture of their finances. He himself had to dig through pdfs published by the Commission to find a comprehensive view of the 2009 expenditures by country. Worse still: as of writing, the brochure ‘Budget 2011: Beyond the crisis, towards new goals’ still redirects to a “page not found” 404 error.

The workshop provided a large overview of the available resources to mine EU-related data, listed on this wiki. Participants were thrilled to see that expenditures could be tracked at the project level, sometimes involving only a few thousand euros (that’s on the Cohesion policy website). Most had no idea that a public register of lobbyists existed (the transparency register).

Data was analyzed using Google Refine, powerful spreadsheet software that can be linked to online services. Taggart demonstrated how a journalist could seamlessly extract data from international directory Open Corporates directly in Google Refine using its reconciliation service. The rationale behind these efforts was that proficiency with such tools will help journalists save time and investigate more efficiently.

Getting the data in a structured format using scrapers or character recognition software is only the first step. Next, Barr explained, journalists can look for elements that contradict the rules (e.g. subsidies given to arms or tobacco companies) or around companies known for their mafia or crime connections. Another approach is hypothesis-based. Strange voting patterns around a local legislation might be linked to conflicts of interest, for instance.

The EU expenditures database can be mashed-up with other sources, such as national registers, where additional information can be pulled. Slovak website Znasichdani, for instance, monitors companies that were awarded public tenders. Switzerland’s Infocube released an application that shows the companies national MPs have a stake in. Each of these initiatives provide material for civic-minded and highly compelling journalism.

Databases, which are often not visible in Google’s index, offer factual bits of information that can prove crucial in some investigations. Knowing that a company folded months after it received government funding, for instance, clearly hints at misdemeanor. Relying on hard data in addition to the usual unnamed quotes is the basis for precision journalism (what Wikileaks’ Julian Assange referred to as scientific journalism), a way of working that provides for more robust results than traditional methods.

The juice is at the national level

Despite these efforts to dig stories, the EU budget is likely among the cleanest in Europe. The Santer Commission, for instance, resigned in 1999 over a fraud scandal where the key charge was a dubious hire by Commissioner Edith Cresson. She took in a close friend and had him paid for 2 year at the tune of €50,000 a year to produce a 24-page report. Outrageous, certainly. But the sum represents less than a minute’s worth of government corruption in Italy, which reaches up to €60 billion a year (no one resigned).

Focusing too much on EU money should not lead European journalists to neglect national and local affairs. The openness of EU institutions (relative to most others in continental Europe) should not work against it but should be used by journalists as a launching pad to investigate bigger organizations. After all, the EU budget represents only about 2% of global government expenditures in Europe.

Participants engaged on this path. Brussels-based investigative journalist Mehmet Koksal, for instance, set out during the workshop to scraping the public journal of the Belgian state to mine the relations between public officials and their private-sector activities.

But such initiatives will be hard to implement without more robust coding skills. The workshop showed that there was a profound need for all kind of skills, from data scraping to statistical analysis to data visualization. Training will be needed in these areas for journalists to become really proficient with data.

What’s more, the question of the value of data-driven reporting is still pushed under the carpet. Barr’s investigation on Structural funds took 8 months. A gross approximation would put the price tag of such an enterprise above €50,000. Not many newsrooms can be convinced in putting that kind of amount on the table and fewer journalists still would be able to commit it on their own. Once the value of a data-based investigation is understood, getting funding will be easier. Assessing the profitability of a data-driven approach must be the next step for the #ddj community.

Comments

About the author(s)

Nicolas Kayser-Bril

Data journalist

Nicolas Kayser-Bril (@nicolaskb) is a Berlin and Paris based nerd. For a living, he tells stories using data. He crunches, grinds, chews and squeezes numbers to extract meaning out of them. That is called data-driven journalism and he was one of the first ones to practice it in Europe.