ICIJ used Talend to load more than 1.4 TB of unstructured data into Neo4j graph database, which leverages the Linkurious graph visualization platform to organize and access the information. The data includes emails, Excel, CSV and PDF documents with text and images about companies and people who are using a hidden system built for avoiding tax payment. ICIJ also used other open source tools to support their “Knowledge Center” and make the information searchable by reporters.

“Talend is our preferred solution when it comes to cleaning, transforming, and integrating the data we receive. It works as a crucial mechanism for enabling us to build a robust database,” said Pierre Romera, CTO at ICIJ. “Working with open source tools like Talend ensures security and reliability of data as our extensive network of investigative journalists review terabytes of files. Backed by an extensive community of contributors, open source solutions enable us to benefit from the latest innovations in data processing, extraction, and visualization.”

Cloud is also a central element of ICIJ’s data journey. The organization uses the power of Amazon Web Services (AWS) to process all the data and make broaden access. ICIJ set up temporary machines in AWS to parallelize data extraction - the organization uses Ubuntu, Tesseract and an in-house tool called Extract to do characters optical recognition and help to extract text from files.

“Moving to the cloud was obvious due to the nature of our mission and the large volume of data we process. Cloud technology offers the scalability we need when we need it, so we can easily manage our workload. With a robust power for processing and security, AWS was the most suitable choice for us,” explained Pierre.

The 13.4M tell-tale documents were obtained by German newspaper Süddeutsche Zeitung that received data from two offshore services firms in countries ranging from Bermuda to Singapore, as well as 19 corporate registries around the world. For about a year, ICIJ worked with hundreds of journalists and media partners on exposing this new lead, which has had a significant impact on well-known individuals and large organizations.

“Since ICIJ revealed the Panama Papers leak in 2016 for which they won the Pulitzer Prize, we have seen how much data management and processing technologies can impact our society,” said Ciaran Dynes, SVP of Products, Talend. “We are pleased to support in-depth investigative journalism and those seeking meaningful insights from data.”

For more information about the data behind the Paradise Papers, you can watch this video. Visit www.talend.com for additional information about Talend integration solution capabilities.

Like this story? Tweet this: Putting #opensource to work for the greater good. @Talend helps @ICIJ expose tax havens in the #ParadisePapers http://bit.ly/2Drjt8K