Keynote | Business
Thursday 16th | 12:15 - 12:45 | Theatre 18
Involving 2.6 TB of data and 11.5 million documents, the Panama Papers was the largest amount of leak files and cross-border investigation in journalism history. For one year, more than 400 reporters across 80 countries dived into this massive trove of information that exposed how the offshore economy works.
Keywords defining the session:
- Big data
- Data leak
A member of the The International Consortium of Investigative Journalists will tell about how they have been able to decode the Panama Papers and revealed the history’s biggest data leak. On May 2015, the International Consortium of Investigative Journalists (ICIJ) obtained from German newspaper Süddeutsche Zeitung an encrypted hard drive with leaked data from the Panamanian law firm Mossack Fonseca. The total size of documents received would end up being 2.6 TB and 11.5 million files. The ICIJ Data and Research Unit, with staff in four countries in two continents, started looking at how to process and analyze the data. The main challenges were dealing with dozens of data formats, putting them into a consistent and visual database, and then making all this data available to journalists worldwide. It immediately became clear that inside the leaked records were files of Mossack Fonseca’s client database, with information of who was secretly using the offshore world. The final challenge was to reverse engineer and reconstruct that database so that journalists – and ultimately the public – could use it. ICIJ used Talend Big Data to easily integrate all the information with Mossack Fonseca’s internal database and have each step documented, so it could later be reproduced or checked by any other member of the remote team. Such work would’ve been painful and difficult to do without such a tool. The complex database reconstruction process and the quality assessment of the project was simplified by using Talend allowing ICIJ to quickly start the investigation with 400 journalists in 80 countries. By September 2015, ICIJ transformed over 3 million database files into a SQL database, which was then transformed using Talend into Neo4j. The results were then graphically visualized using Linkurious, so the journalists could easily find people and stories that they would’ve missed otherwise. ICIJ knew from the beginning that they ultimately wanted to make the database open to the public, so they started working on a public-facing solution. The data quality requirements were raised, since millions of people would see the information and a mistake could be catastrophic for ICIJ in terms of reputation and lawsuits. Talend was key for the ICIJ’s data team to efficiently work remotely across two continents and have each step of the preparation process documented. On April 3, 2016 more than 100 media organizations published the results of the year-long investigation. Included in the list of over 210,000 companies across 21 jurisdictions were activities from the ongoing Syrian war, the looting of resources in Africa, and individual offshore transactions from billionaires,