• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Articles
  • News
  • Events
  • Advertize
  • Jobs
  • Courses
  • Contact
  • (0)
  • LoginRegister
    • Facebook
    • LinkedIn
    • RSS
      Articles
      News
      Events
      Job Posts
    • Twitter
Datafloq

Datafloq

Data and Technology Insights

  • Categories
    • Big Data
    • Blockchain
    • Cloud
    • Internet Of Things
    • Metaverse
    • Robotics
    • Cybersecurity
    • Startups
    • Strategy
    • Technical
  • Big Data
  • Blockchain
  • Cloud
  • Metaverse
  • Internet Of Things
  • Robotics
  • Cybersecurity
  • Startups
  • Strategy
  • Technical

What is the Secret Behind the Panama Papers?

Hardik Gohil / 6 min read.
May 18, 2016
Datafloq AI Score
×

Datafloq AI Score: 78.67

Datafloq enables anyone to contribute articles, but we value high-quality content. This means that we do not accept SEO link building content, spammy articles, clickbait, articles written by bots and especially not misinformation. Therefore, we have developed an AI, built using multiple built open-source and proprietary tools to instantly define whether an article is written by a human or a bot and determine the level of bias, objectivity, whether it is fact-based or not, sentiment and overall quality.

Articles published on Datafloq need to have a minimum AI score of 60% and we provide this graph to give more detailed information on how we rate this article. Please note that this is a work in progress and if you have any suggestions, feel free to contact us.

floq.to/jjWke

The latest buzz about Panama Papers has shaken the world. As we all know, the Panama Papers is a set of 2.6 TB of data that includes 11.5 million confidential documents with detailed information about more than 214,000 offshore companies listed by the Panamanian corporate service provider Mossack Fonseca.

The Panama Papers has set an excellent example for the world about the importance of data science when it comes to analyzing big data. This leak makes us realize that appropriate approaches are needed to handle the challenges of data management for the present and the future.

Lets take a deep dive into the Panama Papers and dig down the secret behind the biggest leak ever

This leak contains 4.8 million emails, 3 million database entries, 21.5 million PDFs, around one million images and 320,000 text documents. This is described as worlds biggest cache of data ever handed over to journalists.

panama-papers-document-leak

How Sddeutsche Zeitung and ICIJ Analyzed this Data

Overall this leaked database has been analyzed with the use of latest techniques for advanced document and data analysis. This indicates the importance of technologys role in helping the International Consortium of Investigative Journalists (ICIJ) and Sddeutsche Zeitung in creating the biggest news story of the year so far.

The main challenge in this scenario would be the variety of information that is provided to Sddeutsche Zeitung and the ICIJ. Most of the data they have received was unstructured data as we have seen in previous diagram. The most challenging aspect of analyzing data from heterogeneous sources is ingesting everything into a form that is consumable, queryable, and searchable. The indexing phase required for fast retrieval from massive unstructured data sources requires robust parallel processing as given below.Unstructured Data

Lets understand this process in detail. In its simplest form, unstructured data is in the form of text files and HTML files. It is relatively easy for engines to handle them as they are basically plain text.

Common document formats are only slightly more complex. Word processor documents, presentations, and so on mainly contain text but they may also have metadata and embedded content. They are popular and are not much more difficult to index compared to plain text files.

Container-like structures embed many objects along with their metadata. Folders and compressed archive files are some examples of container type formats. With some engines it is difficult to extract items inside containers and index them.

Larger and more complex container types such as Microsoft SharePoint, CMSs, and email archives have considerably more embedded items. Often, embedded components and metadata are stored separately. This kind of architecture does not lend itself to easy indexing. Therefore, despite availability of native searching capabilities, these systems are far more difficult to index for most engines.

The highest level of complexity is introduced by compliance-based storage systems. These are secure and often add a layer of obfuscation to make it difficult to tamper with data once it is stored in these systems.

In addition to the progressive difficulty of penetrating these various formats, there is the challenge of processing at scale, graceful handling of faults and failures, and load balancing. All this necessitates a specialized system that can perform these non-trivial tasks and transform unstructured data into a fully indexed, readily searchable form.

OCR and Image Processing

Many files and documents are not available in a digital, textual format. When it comes to images and pdfs, it is the most difficult task to analyze the data. Especially when they need to identify what do those images and PDFs contain, the language, how the text connected together with other blocks of text.

For instance, a newspaper clipping containing vital information cannot be directly fed into a retrieval system. It must be first transformed into a digital format and then its text must somehow be extracted from it.

This is where OCR comes in. Optical Character Recognition, or OCR, is the technology that helps read text off images containing handwritten, typewritten, or printed text. It produces a text file that can be indexed by an indexing engine.


Interested in what the future will bring? Download our 2023 Technology Trends eBook for free.

Consent

A good OCR system allows custom specification of font name, size, spacing, and so on, and can adjust to different aspect ratios and scales. This enables highly precise scanning of text off images and PDFs and is particularly useful for processing items containing printed text. Modern OCR systems might also employ machine learning techniques to achieve a high level of precision. The ICIJ contacted big data analytics firm Nuix to help make sense of the vast amount of information it had received. Lets see Nuixs role in revealing the Panama Papers.

Check this out:

Text Analytics

Text analytics is a wide field with various applications.

Named entity recognition (NER) is a subtask of text analytics that refers to finding elements of text that refer to predefined categories such as names of people, places, organizations, and so on, or numerical values such as monetary values, quantities, percentages, etc.

Some modern NER systems for English achieve near-human performance. Apart from grammar-based techniques, statistical models and machine learning techniques often aid NER in achieving greater performance.

Graph Database

A graph database, also called a graph-oriented database is a database that uses graph structures for semantic queries. A graph database is principally a collection of nodes and edges to store, map and query relationships. Most of the graph databases are NoSQL in nature and store their data in a key-value store or a document-oriented database. Each node stands for an entity (for example, a person or business) and each edge signifies a connection or relationship between two nodes. Every node in a graph database is defined by a unique identifier, a set of outgoing edges and/or incoming edges and a set of properties stated as key-value pairs. Graph databases are best suited for analyzing interconnections, that is the reason many Fortune 500 companies use graph databases to mine data from other sources.

Check our this experiment: https://linkurious.icij.org/widget/5eab4965

How Gunnlaugsson hides his secret assets.

Graph databases are also beneficial for dealing with data in business disciplines that includes complex relationships, such as those in supply chain management, identification of the source of an IP telephony issue and serving recommendations based on the graph.

How graph database helped ICIJ

ICIJ had already demonstrated its ability to coordinate worldwide investigations numerous times in the past with the help of Linkurious Enterprise. Linkurious is a partner of the ICIJ since the Swiss Leaks scandal.
The challenging part for Linkurious Enterprise was to demonstrate data analysis proficiency to make the documents exploitable by the reporters. They pulled out the metadata of documents with the use of Apache Solr and Tika, then connected all the information together from the leaked databases, and created a graph of nodes and edges. The data was stored in the Neo4j graph database provided by their partner Neo Technology. After in-depth analysis, they got unique insights into the offshore banking sector indicating the relationships between banks, offshore companies, clients and their lawyers. In this case this graph includes many public personalities worldwide who have used Mossack Fonsecas services. Lets look at the example given below:

Check our this experiment: https://linkurious.icij.org/widget/7dcc235e

The network of middlemen and companies hiding Putins wealth

All the practices used by the journalists are very simple as they have collected lists of high profile individuals and their first circle. Then they have tried to identify links between these individuals and offshore entities and connected the dots together and quickly did the fact checking with the help of graph database technology.

Revealing the Secret

All of us know that the secret behind the leak of the Panama Papers is definitely the information the documents contain, but the real hero in this story is data science, which enabled the ICIJ and Sddeutsche Zeitung to carry out this journalistic scoop.

This is not the end of the revolution- the bigger picture is yet to be revealed.

Tip of the day: If you have tons of data in various formats that needs to be analyzed, feel free to contact our data scientists.

Special thanks to Shail Deliwala, Data Scientist at Softweb Solutions, for his technical inputs.


This post was originally published on –https://www.softwebsolutions.com/

Categories: Big Data
Tags: analysis, Big Data, data leak, database, processing

About Hardik Gohil

Hardik is a writer and content marketer who works at Softweb Solutions Inc. based in Chicago, Illinois. A writer by a day and a reader by night, he's a tech-savvy person who always craves for something innovative and trendy. He counts writing, playing keyboard, travelling and fries among his myriad interests. You can find him blabbering about data science, big data analytics, data visualization, content and social media marketing, and other freaky stuff at @iamhard1k.

Primary Sidebar

E-mail Newsletter

Sign up to receive email updates daily and to hear what's going on with us!

Publish
AN Article
Submit
a press release
List
AN Event
Create
A Job Post
Host your website with Managed WordPress for $1.00/mo with GoDaddy!

Related Articles

The Advantages of IT Staff Augmentation Over Traditional Hiring

May 4, 2023 By Mukesh Ram

The State of Digital Asset Management in 2023

May 3, 2023 By pimcoremkt

Test Data Management – Implementation Challenges and Tools Available

May 1, 2023 By yash.mehta262

Related Jobs

  • Software Engineer | South Yorkshire, GB - February 07, 2023
  • Software Engineer with C# .net Investment House | London, GB - February 07, 2023
  • Senior Java Developer | London, GB - February 07, 2023
  • Software Engineer – Growing Digital Media Company | London, GB - February 07, 2023
  • LBG Returners – Senior Data Analyst | Chester Moor, GB - February 07, 2023
More Jobs

Tags

AI Amazon analysis analytics app Apple application Artificial Intelligence BI Big Data business CEO China Cloud Companies company content costs court crypto customers Data digital future Google+ government industry information machine learning market mobile Musk news Other public research revenue sales security share social social media strategy technology twitter

Related Events

  • 6th Middle East Banking AI & Analytics Summit 2023 | Riyadh, Saudi Arabia - May 10, 2023
  • Data Science Salon NYC: AI & Machine Learning in Finance & Technology | The Theater Center - December 7, 2022
  • Big Data LDN 2023 | Olympia London - September 20, 2023
More events

Related Online Courses

  • Oracle Cloud Data Management Foundations Workshop
  • Data Science at Scale
  • Statistics with Python
More courses

Footer


Datafloq is the one-stop source for big data, blockchain and artificial intelligence. We offer information, insights and opportunities to drive innovation with emerging technologies.

  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Recent

  • 5 Reasons Why Modern Data Integration Gives You a Competitive Advantage
  • 5 Most Common Database Structures for Small Businesses
  • 6 Ways to Reduce IT Costs Through Observability
  • How is Big Data Analytics Used in Business? These 5 Use Cases Share Valuable Insights
  • How Realistic Are Self-Driving Cars?

Search

Tags

AI Amazon analysis analytics app Apple application Artificial Intelligence BI Big Data business CEO China Cloud Companies company content costs court crypto customers Data digital future Google+ government industry information machine learning market mobile Musk news Other public research revenue sales security share social social media strategy technology twitter

Copyright © 2023 Datafloq
HTML Sitemap| Privacy| Terms| Cookies

  • Facebook
  • Twitter
  • LinkedIn
  • WhatsApp

In order to optimize the website and to continuously improve Datafloq, we use cookies. For more information click here.

Dear visitor,
Thank you for visiting Datafloq. If you find our content interesting, please subscribe to our weekly newsletter:

Did you know that you can publish job posts for free on Datafloq? You can start immediately and find the best candidates for free! Click here to get started.

Not Now Subscribe

Thanks for visiting Datafloq
If you enjoyed our content on emerging technologies, why not subscribe to our weekly newsletter to receive the latest news straight into your mailbox?

Subscribe

No thanks

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

Marketing cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.

Please enable Strictly Necessary Cookies first so that we can save your preferences!