Big data management: 5 things you need to know

by David Loshin, President, Knowledge Integrity Inc.

As more organizations adopt big data platforms, concern mounts that application development may suffer from the lack of good practices for managing the data powering those applications. When we talk about big data management in relation to big data platforms (like those combining commodity hardware with Hadoop), it’s clear that big data technologies have created a need for new and different data management tools and processes. Here are five things you need to know about big data management that will help ensure consistency and trust in your analytic results.

1. Business users can do some big data management by themselves

One of the mantras of big data is availability – enabling access to numerous massive data sets in their original formats. Today’s business users, who are more adept than their predecessors, often want to access and prepare the data in its raw format rather than having it fed to them through a chain of operational data stores, data warehouses and data marts. Business users want to scan the data sources and craft their reports and analyses around their own business needs.

Supporting business user self-service for big data has two big data management implications:

To permit data discovery, users will have to be allowed to peruse the data independently.

Users will need data preparation tools to assemble the information from the numerous data sets and present it for analysis.

2. It’s not your parent’s (or grandparent’s) data model

Our conventional approach to capturing and storing data for reporting and analysis centers on absorbing data into a predefined structure. But in the big data management world, the expectation is that both structured and unstructured data sets can be ingested and stored in their original (or raw) formats, eschewing the use of predefined data models. The benefit is that different users can adapt the data sets in the ways that best suit their needs.

To reduce the risk of inconsistency and conflicting interpretations, though, this suggests the need for good practices in metadata management for big data sets. That means solid procedures for documenting the business glossary, mapping business terms to data elements, and maintaining a collaborative environment to share interpretations and methods of manipulating data for analytical purposes.

Managing big data ... entails a new cadre of technologies and processes to enable broader data accessibility and usability.

3. Quality is in the eye of the beholder

In conventional systems, data standardization and cleansing are applied prior to storing the data in its predefined model. One of the consequences of big data is that providing the data in its original format means no cleansing or standardizations are applied when the data sets are captured.

While this provides greater freedom in the way data is used, it becomes the users’ responsibility to apply any necessary data transformations. So, as long as user transformations don’t conflict with each other, data sets may be easily used for different purposes. This implies the need for methods to manage the different transformations and ways to ensure that they don’t conflict. Big data management must incorporate ways to capture user transformations and ensure that they are consistent and support coherent data interpretations.

4. Understanding the architecture improves performance

Big data platforms rely on commodity processing and storage nodes for parallel computation using distributed storage. Yet if you remain unfamiliar with the details of any SQL-on-Hadoop’s query optimization and execution models, you may be unpleasantly surprised by unexpectedly poor response times.

For example, complex JOINs may require that chunks of distributed data sets be broadcast to all computing nodes – causing huge amounts of data to be injected into the network and creating a significant performance bottleneck. The upshot is that understanding how the big data architecture organizes data and how the database execution model optimizes queries will help you write data applications with reasonably high performance.

5. It’s a streaming world

In the past, much of the data that was collected and consumed for analytical purposes originated within the organization and was stored in static data repositories. Today, there is an explosion of streaming data. We have human-generated content such as data streamed from social media channels, blogs, emails, etc. We have machine-generated data from myriad sensors, devices, meters and other Internet-connected machines. We have automatically-generated streaming content such as web event logs. All of these sources stream massive amounts of data and are prime fodder for analysis.

This is the crux of the issue. Any big data management strategy must include technology to support stream processing that scans, filters and selects the meaningful information for capture, storage and subsequent access.

Considerations for big data management

Managing big data not only subsumes many of the conventional approaches to data modeling and architecture, it entails a new cadre of technologies and processes to enable broader data accessibility and usability. A big data management strategy must embrace tools enabling data discovery, data preparation, self-service data accessibility, collaborative semantic metadata management, data standardization and cleansing, and stream processing engines. Being aware of these implications can dramatically speed the time-to-value of your big data program.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at b-eye-network.com and numerous books, white papers, and web seminars on a variety of data management best practices.