Enterprise Information management, data, data quality

Menu

Big data or big disaster?

When I first started posting about big data, very few users existed in South Africa.

Today, most last organisations have a Hadoop data lake – in many cases replacing traditional ETL and/or acting as a data archive as well as a feeder to the enterprise data warehouse and, various operational data marts.

In a few, short years, Hadoop has moved from new experiment to a core component of the data architecture.

The challenges of finding, understanding and delivering trusted data through the data lake are more relevant than ever.

In many cases, the intention of the data lake is to support a more agile approach to data sourcing – allowing us to both empower IT staff and data scientists to deliver new insights more quickly – and to empower self service business intelligence for end users.

This democratization of data – making data available to the people that need it – is a good thing.

But early adopters are finding than an ungoverned and undocumented data lake cannot deliver.

In the chaos of the data swamp, legitimate users cannot identify the data they need, or cannot trust it as they cannot guarantee its source, measure its timeliness, track its quality, etc.

Conversely, without proper governance, it is possible for sensitive data to be exposed to unauthorized users. In the wake of the Facebook / Cambridge Analytica scandal, and with GDPR and PoPIA looming business is beginning to understand the importance of protecting sensitive data from illegitimate or unethical uses.

In order to be trusted, and useful, the data lake must:

Make it easy to find the data I need

Make it easy to understand the context of the data – i.e. track its origins and technical details (lineage)

Make it possible to assess the quality of the data

Allow me to find similar (or related) data sets to broaden my analytics potential, and to compare various data sets against each other to find the best possible set for my purpose

Govern access to data ( e.g. through data sharing agreements) to ensure that sensitive data sets are protected and to measure which data is useful

Recognise that the data lake is part of a broader data ecosystem and provide insights into how and where it fits in.

One approach to solving these challenges, that fits with the concept of data democratization, is the governed data catalog.