Is Your Data Lying to You?

Tags:

As companies strive to become more agile in today’s ever-changing business world, a common theme is getting data faster and, in turn, getting insights from data faster. That’s where the notion of schema-free queries often comes in, where all sorts of unstructured data goes into files in Hadoop (Hadoop Distributed File System w/ e.g. HIVE or Drill querying), SQL and NoSQL databases that support late binding. Late binding, to get on the same page, is the practice of transforming and binding data based on relationships at program runtime, vs. early binding where transformations are done when data moves from source systems into the database.

These databases or data stores often enable rapid exploration via schema-free queries. And, it’s true that rapid exploration is a key piece of any agile company’s foundation, just as it’s true that some corners of the technology world are evolving so quickly that having to slow down and put governance and forethought into data storage and structure can be the difference between success and failure.

But with schema-free queries, it also pays to be prudent. If you’re not careful, they can make your data dishonest.

The fact that data isn’t wrapped in governance is fine (and preferred) for just poking around. We opt for schema-free queries in the first place because a lot is changing around us and new data sources are emerging regularly. The fact is that schema less/free is great for an initial prototype, but once we move past the prototype stage, the lack of schema quickly becomes a governance nightmare.

A Crumbling Analytics House Built on Schema-Free

Otherwise, whatever you produce – whether it’s a dashboard, or some metric read-out – could begin lying to you. This is the exact problem we faced in the mid 2000s during my tenure at eBay when an entire experimentation platform, with hundreds of experiments built on late binding, was starting to fold like a house of cards. The reason was that the incoming data started changing on us without any controls in place, but there was no governance to catch the change.

It only takes one developer upstream going about his day-to-day work to change the meaning of a tag, thinking he is the only one using it. Once that happens, everything built with that data could produce slightly to completely different results. Plus, there is no lineage with schema-free queries, so you won’t even know that anything has been changed!

Put simply, schema-free queries can quickly become a foundation for a house that crumbles after it’s built.

Don’t get me wrong: Late binding is a must have capability in today’s data infrastructure. We have long been working on getting more and more late binding features into our various products with the latest example being high performance and binary JSON storage and processing natively within the Teradata database.

Building Trust in Your Data

While systems need to support both late and early binding, tight and loose coupling, the evolution towards schema (even if only for subsets of data) is a must have step for any data product development process.

While systems need to support both late and early binding, tight and loose coupling, the evolution towards schema (even if only for subsets of data) is a must have step for any data product development process.

Schema is not just a nuisance. Its not there to be painful, it’s there to control structure and actually reject mismatches along the way. It forces a different thinking on production quality, than a free flowing unstructured lake that changes by the minute and is hard to rely on in terms of repeatability. Trust in repeatable and consistent results is key to the success of Big Data.

The lesson is that you need to constantly check if schema-less data is being used for production purposes. Similarly, the moment you find something with your data exploration, figure out what tags you’re using for the production-like environment, and make sure you have the ability to check on them. While there’s often value in getting to data quickly to uncover new things, there is also value in knowing that a particular tag has a certain meaning – especially once you make the move from exploration to production.

As part of the series of articles on the concept of the Sentient Enterprise I have talked about the need for the Layered Data Architecture – a data classification framework that allows for the rapid and agile integration of unstructured or late binding data. The key to success is to properly classify all your incoming data as it is being accessed, used and relied on and to elevate data elements from none- to loosely- to tightly coupled status.

When we build algorithms, models, reports – any form of repeatable usage of data, we are obligated to have control and authority over the data behind it, so we can make sure it will continue to do what it claims to do.

Oliver Ratzesberger is President and Chief Executive Officer of Teradata Corporation. Until January 2019, he served as Teradata’s Chief Operating Officer with global operating responsibility for Teradata’s operations and led the company’s strategies for go-to-market, products, and services. He joined Teradata’s Board of Directors in November 2018.

Mr. Ratzesberger has an extensive background in analytics, big data, and software development. Prior to Teradata, he worked for both Fortune 500 and early-stage companies, holding positions of increasing responsibility in software development and IT, including leading the expansion of analytics at eBay.

A pragmatic visionary, Oliver frequently speaks and writes about leveraging data and analytics to improve business outcomes. His book with co-author Mohanbir Sawhney, “The Sentient Enterprise: The Evolution of Decision Making,” was published in 2017 and was named to the Wall Street Journal Best Seller List.

Oliver is a graduate of Harvard Business School’s Advanced Management Program and earned his engineering degree in Electronics and Telecommunications from HTL Steyr in Austria. He lives in San Diego with his wife and two daughters.