Data Governance in a Big Data World

Robust governance programs will always be rooted in people and process, but you also need to choose the right technology, especially when working with big data.

By Mitesh Shah

September 15, 2017

Organizations across the globe are investing in systems capable of housing and processing data in ways previously unimagined. In some cases, enterprises are even replatforming their existing IT environments based on these new systems. These big data systems have yielded tangible results: increased revenues and lower costs. Yet positive outcomes are far from guaranteed. To truly get value from one's data, these new platforms must be governed.

The term data governance strikes fear in the hearts of many data practitioners. Because it is often vaguely defined and misunderstood, many simply turn to a technology-only approach to solve their governance needs. The complexity that comes with many big data systems makes this technology-based approach especially appealing even though it's well known that technology alone will rarely suffice. What is perhaps less known is that technologies themselves must be revisited when optimizing for data governance today.

Before we define what data governance is, perhaps it would be helpful to understand what data governance is not.

Data governance is not data lineage, stewardship, or master data management. Each of these terms is often heard in conjunction with -- and even in place of -- data governance. In truth, these practices are components of some organizations' data governance programs. They are important components, but they are merely components nonetheless.

At its core, data governance is about formally managing important data throughout the enterprise and thus ensuring value is derived from it. Although maturity levels will vary by organization, data governance is generally achieved through a combination of people and process, with technology used to simplify and automate aspects of the process.

Take, for example, security. Even basic levels of governance require that an enterprise's important, sensitive data assets are protected. Processes must prevent unauthorized access to sensitive data and expose all or parts of this data to users with a legitimate "need to know." People must help identify who should or should not have access to certain types of data. Technologies such as identity management systems and permission management capabilities simplify and automate key aspects of these tasks. Some data platforms simplify chores even further by tying into existing username/password-based registries, such as Active Directory, and allowing for greater expressiveness when assigning permissions, beyond the relatively few degrees of freedom afforded by POSIX mode bits.

We should also recognize that as the speed and volume of data increase, it will be nearly impossible for humans (e.g., data stewards or security analysts) to classify this data in a timely manner. Organizations are sometimes forced to keep new data locked down in a holding cell until someone has appropriately classified and exposed it to end users. Valuable time is lost. Fortunately, technology providers are developing innovative ways to automatically classify data, either directly when ingested or soon thereafter. By leveraging such technologies, a key prerequisite of the authorization process is satisfied while minimizing time to insight.

How is Data Governance Different in the Age of Big Data?

By now, most of us are familiar with the three V's of big data:

Volume: The volume of data housed in big data systems can reach into the petabytes and beyond.

Variety: Data is no longer only in simple relational format; it can be structured, semistructured, or even unstructured; data repositories span files, NoSQL tables, and streams.

Velocity: Data needs to be ingested quickly from devices around the globe, including IoT sources. Data must be analyzed in real time.

Governing these systems can be complicated. Organizations are typically forced to stitch together separate clusters, each of which has its own business purpose or stores and processes unique data types such as files, tables, or streams. Even if the stitching itself is done carefully, gaps are quickly exposed because securing data sets consistently across multiple repositories can be extremely error-prone.

Converged architectures greatly simplify governance. In converged systems, several data types (e.g., files, tables, and streams) are integrated into a single data repository that can be governed and secured all at once. There is no stitching to be done per se because the entire system is cut from and governed against the same cloth.

Beyond the three V's, there is another, more subtle difference. Most, if not all, big data distributions include an amalgamation of different analytics and machine learning engines sitting "atop" the data store(s). Spark and Hive are just two of the more popular ones in use today. This flexibility is great for end users because they can simply pick the tool best suited to their specific analytics needs. The trouble from a governance perspective is that these tools don't always honor the same security mechanisms or protocols, nor do they log actions completely, consistently, or in repositories that can scale -- at least not "out of the box."

As a result, big data practitioners might be caught flat-footed when trying to meet compliance or auditor demands about, for example, data lineage -- a component of governance that aims to answer the question "Where did this data come from and what happened to it over time?"

Streams-Based Architecture for Data Lineage

Luckily, it is possible to solve for data lineage using a more prescriptive approach and in systems that scale in proportion to the demands of big data. In particular, a streams-based architecture allows organizations to "publish" data (or information about data) that is ingested and transformed within the cluster. Consumers can then "subscribe" to this data and populate downstream systems in whatever way is deemed necessary.

It is now a simple matter to answer basic lineage questions such as, "Why do my results look wrong?" Just use the stream to rewind and replay the sequence of events to determine where things went awry. Moreover, administrators can even replay events from the stream to recreate downstream systems should they get corrupted or fail.

This is arguably a more compliance-friendly approach to solving for data lineage, but certain conditions must be met. Specifically:

The streams must be immutable (i.e., published events cannot be dropped or changed)

Permissions are set for publishers and subscribers of all events

Audit logs are set to record who consumed data and when

The streams allow for global replication, allowing for high availability should a given site fail

Summary

Robust governance programs will always be rooted in people and process, but the right choice and use of technology are critical. The unique set of challenges posed by big data makes this statement true now more than ever. Technology can be used to simplify aspects of governance (such as security) and close gaps that would otherwise cause problems for key practices (such as data lineage).

About the Author

Mitesh Shah is senior technologist with MapR and is responsible for security and data governance strategy. Prior to MapR, Mitesh held positions in enterprise security at organizations including the Federal Reserve and Salesforce.com. Mitesh has a degree in computer science from Cornell and an MBA from The Wharton School of the University of Pennsylvania. You can contact the author at miteshshah@mapr.com.

Featured Resources

Find out what's keeping teams up at night and get great advice on how to face common problems when it comes to analytic and data programs. From head-scratchers about analytics and data management to organizational issues and culture, we are talking about it all with Q&A with Jill Dyche.