The Panama Papers: the tech that helped the International Consortium of Investigative Journalists unearth the biggest leak in history

The data unit head of the ICIJ, winners of 'Innovative Team of the Year' at the techies, explains the tech behind the story

Tom is online editor. He studied English Literature and History at Sussex University before gaining a Masters in Newspaper Journalism from City University. He's particularly interested in the public sector and the ethical implications of emerging technologies.

German journalist Bastian Obermayer was at his parents' home one evening late in 2014, when his laptop pinged with a message alert.

"Hello. This is John Doe," an anonymous source had written. "Interested in data? I'm happy to share". He had two conditions the Süddeutsche Zeitung reporter first had to agree to: They could only communicate over encrypted channels, and they could never meet in person.

Image: iStock/JFsPic

"My life is in danger," the whistle-blower explained.

Obermayer agreed, and soon began to receive internal documents from Panamanian law firm Mossack Fonseca, the world’s fourth biggest provider of offshore services.

Over the months that followed, John Doe released a trove of documents from the company's private archive that amounted to the biggest leak in history. Buried within them were the secrets of how global celebrities and prominent politicians hid billions of dollars in tax havens, but the 2.6 terabytes of data was too large and the global implications too wide for the newspaper to process independently.

"Holy shit!" she remembers on a sunny day in her Madrid home. "We thought it was going to be one terabyte, which was four times higher than what we had dealt with before."

The ICIJ is made up of only 13 people, divided evenly between the technology team and journalists. They had already worked extensively on offshore stories since the organisation was founded in 1997, fostering cross-border collaboration among journalists to take on topics of global scope.

The network's operational model is to invited multiple international media organisations to work collectively on a project for a set period of time before publishing the stories that it yields together.

"After a couple of weeks of work we saw that this was different than anything we had worked on before because it was forty years of data of the history of a company," says Cabra.

Mossack Fonseca operated in 21 tax jurisdictions across more than 40 offices around the world. Their clients included heads of states, celebrities, criminals, and billionaires. But before they could be exposed, the data had to be analysed.

The tech behind the stories

The Panama Papers have gained fame not only for the scoops they revealed but also for the ground-breaking technology used to unearth them. The ICIJ had already adapted tools from open source software during previous projects that provided a training ground for this one.

"We were dealing with a lot of information that was unstructured," says Cabra. "We were dealing with a lot of PDFs and images — so non-machine readable materials — and that meant that we had to invest a lot in technology."

Cabra relied on one full-time developer to create a process for what she refers to as "an army of servers" to conduct computer recognition imaging and Optical Character Recognition (OCR) of the documents, extract the text and make it searchable by robots.

"We created this processing chain like we were in a factory, but it was in Amazon in the cloud, where we would have a queue of all the documents, 11.5 million files," says Cabra. "And then the queue would send the documents to 30 different machines, 30 different servers that we had in the cloud, and the first thing that the server would do is [ask] do I know how to extract text from this?"

It would first send the document to a modified version of Apache Tika to extract the text, from there to Tesseract for the OCR and then onto Apache Solr to index the data. Solr lacked a user interface, so a piece of software normally used by librarians called Project Blacklight was added to add an accessible search system for journalists.

ICIJ Data Unit head Mar Cabra has worked for the organisation since 2011. Image: Antonio Delgado

The team also developed their own private search engine to help reporters find information by modifying an open source Ruby on Rails Engine called Blacklight. The same software is used to locate books in the library of Columbia University.

Roughly a quarter of the data was information from Mossack Fonseca's internal database of forty years worth of clients and 214,000 companies with connections to more than 200 countries. The ICIJ quickly realised that the network was too vast for a simple search engine to service.

They decided to use a second platform to explore that structured internal database of the company, using graphs from the French visualization softwareLinkurious and Neo4J as the running database behind it. The tool visualised the links between people and companies in the leaks as "networks of interconnected dots."

"We’re visual animals, human beings, so once you have it visually, all of a sudden it gives you this new level of understanding, and reporters really liked it," says Cabra.

"The good thing about this is that everybody knows how to double click on a dot. So you double click on a dot and then you click in, and then all of a sudden you are finding connections that you had not found before and of course then the data-savvy people can write queries and find better stories."

The company then released the structured data to the public in the offshore leaks database, "so that the whole world could become investigators with us," says Cabra. It’s had more than eight million visitors and more than 50 million pages views since, from people on the street to tax authorities and law enforcement officials.

European Union law enforcement agency Europol was among those who downloaded the data. It found 3,469 probable matches between the Panama Papers database and its own files about organised crime, tax fraud and other criminality, 116 of which were related to Islamic terrorism.

Less serious scandals were also discovered. The Spectator used the public database to find the name of Harry Potter actress Emma Watson among the interconnected dots, listed as a beneficiary of an offshore company based in the British Virgin Islands. Watson claimed the company was set up solely to preserve her privacy.

"That's the problem with the offshore world," says Cabra. "It's like this big black hole where anything can exist, the good and the bad."

Interconnected dots for a global community

"The key of what we do is called 'radical sharing'," says Cabra. "We use this methodology where we share everything we obtain with all the reporters in the team. That means that we ended up sharing 11.5 million files — 2.6 terabytes — with around 400 reporters in about 80 countries."

The ICIJ relied on effective collaborative technology to foster these large cross-border projects, including a social network based on the Oxwall open source community software.

"When we got the software, [on] the page where you enter the user details and you create your user profile, one of the default questions is are you looking for a male or female?" Mar remembers.

"It's a social network that could be used for dating, but actually isn't an investigative collaboration like dating? In the sense that you're on the same platform and you're not sharing personal information about yourself but you're sharing what you found."

The Global I-Hub collaboration tool they developed functioned like any other social network, but instead of friends sharing photos it was for reporters sharing leads. They could use it to save time and also share their struggles and offer each other suggestions for solutions.

"You would wake up in the morning and log into I-Hub and all of a sudden find the story that would be the front page of your paper found by somebody else while you were asleep," says Mar.

"All of a sudden you have an extended newsroom that is there to help you or to move your investigation forwards.

The Global I-Hub repurposed and added security the Oxwall social network it was based on. Image: ICIJ

"We worked with more than 100 media organisations in the Panama papers and they've just transformed in that they now look at the way they approach products differently.

“I think that for a long time reporters and media organisations have been working as lone wolves and not acknowledging that we live in a global world that is interconnected, so I think that peer-to-peer work collaborations among journalists to better work and provide a more accurate perspective of what's happening are key in this era of fake news."

Cabra describes the ICIJ’s roles as that of "the UN of journalism," bringing the national and corporate cultures from more than 100 different companies together.

The flat hierarchy supported a collaboration geared towards mutual goals, but certain situations such as when to publish demanded someone made a final decision that all the others would agree to.

It wasn’t always easy. The Guardian explained that legal obligations meant they must approach people for comments at a very early stage, but in countries such as Spain "you're lucky if your journalist calls you the day before," says Mar.

The ICIJ’s neutral position as "the Switzerland in the team", as she calls it, made it the best-placed body to determine which solution best catered to the needs of all the separate parties.

Within eight months of releasing the data, the team had published 4,700 stories. That figure doesn’t even count all the stories written by media organisations from outside the project.

They found high profile politicians and A-list celebrities. The Prime Minister of Iceland was forced to resign as a direct result of their investigations and the Mossack Fonseca founders are currently in custody in Panama on allegations of money laundering.

"I've been working on offshore stories for the past five years now so I've got to see a lot of things right and I have to say not much surprises me anymore," says Cabra.

"I got the first shock when I first started working. It's like, 'oh my god, even the sports shop in the north of Spain has a BVI [British Virgin Islands] company!' For me the first shock when I started investigating the offshore world is everybody's here, everybody's using this parallel economy.

"What fascinates me, and I think we did a great job [of this] in the Panama Papers, are the enablers. Who are the enablers of this system?

"We tend to focus our attention on 'oh this rich person or this politician was using offshore’', but who were the enablers that made this possible? And we did a very interesting story about the role of banks in enabling this system, and that I think that was one of my favourite stories."

Following the story as it develops

The ICIJ Data Unit didn't even exist until three years ago. They might lack the resources of corporate bodies, but they have also far less layers of bureaucracy to navigate in the decision-making process.

"We managed to do all this with just three developers," says Cabra. "When I talk to the corporate world about this, when I've been talking at a couple of tech conferences, they look at me like 'wow!'

"We're a very small organisation that operates as a startup, that means that we can make changes fast, we can take decisions fast, and create teams,"

Cabra nonetheless hopes to expand their technology capabilities in the coming years. The Panama Papers, she says, would have benefited from tools that automatically extract all the entities from the documents, and enable collections of documents to talk to each other such as a trove of official cassettes uncovered by colleagues in Argentina they couldn't use.

But the impact of the work remains indelible and has helped make data journalism mainstream. Cabra had been reminded of that while watching an episode of American legal drama Billions the day before we spoke.

"The public attorneys were saying oh how we're stuck with these offshore connections of these people, I think I'm going to look into the Panama papers," she laughs. "It was so funny to see it in Billions, in a TV series, because that's what happening to us every day."

The resource will remain for many years to come as connected stories emerge. Cabra is considering developing a Panama Papers notification system that would function in a similar way as Google Alerts, sending messages automatically when news emerges connected to the database.

“Imagine what we can get to do if we use much more complex techniques?" she asks. “The reality is that we barely scratched the surface of the Panama papers.”