Can Failed Data Lakes Succeed As Data Marketplaces?

All over the world, data lake projects are foundering, not because they are not a step in the right direction, but because they are essentially uncompleted experiments. How can these data lake projects be saved?

Short answer: Data lakes can only be saved if they are reconceived. For a data lake to succeed, the sponsors must assert a new complete vision. As part of a Research Mission, I am leading at Early Adopter Research, we are looking into why data lakes have failed and what new visions may lead to success.

In this installment, we are going to examine the vision that Podium Data has put forth for a data marketplace and the data lake failure modes the company has discovered in working with dozens of clients.

How we got here

The original energy behind the data lake sprang up because Hadoop showed that it was possible to process and store big data in nearly limitless amounts in an affordable way, using new computing paradigms. The general idea was that companies could:

Affordably store the growing amount of big data becoming available from machine data sources (logs and other digital exhaust) and all sorts of sensors (IoT)

Imitate the successes of the web-scale Internet companies by using new computing paradigms (MapReduce) to extract valuable signals.

Use new methods of data management (such storing flat files, JSON, Parquet, and Avro) and new modes of data processing (schema on read) to expand the reach of the data lake to new types of data.

Later in the development of Hadoop and the data lake concept, another goal was added:

Using data lakes as a cheaper, more flexible alternative for certain ETL and batch workloads running on expensive data warehouses.

The problem is that these general concepts about data lakes never worked out to create mission-critical computing infrastructure that radiates value the same way that data warehouses do, despite their flaws.

The main roadblock has been that once companies store their data in the data lake, they struggle to find a way to operationalize it. The data lake has never become a product like a data warehouse. Proof of concepts are tweaked to keep a desultory flow of signals going.

This is a long way from what happens in a typical data warehouse, where extensive data engineering produces “gold standard” data, suitable for widespread operational use, and SQL provides the query mechanism that operationalizes business data. Data engineering enabled companies to generate business-ready data that is sufficiently organized and documented for thousands of business users to access using a variety of analytic tools.

With the data lake, while companies can store massive amounts and varieties of data, they have been unable to effectively manage that data and allow a large number of people with moderate expertise levels to explore the data, come up with useful queries, extract the signal through some regular production process that becomes part of the way a business runs.

Some companies, like Netflix, have managed to operationalize a data lake using the Netflix Genie software, now open sourced and in its third version, which helps run and optimize batch jobs at scale. And it must also be noted that for certain use cases, Hadoop and purpose-built data lake-like infrastructure are solving complex and high-value problems. But in most other businesses, the data lake got stuck at the proof of concept stage. That is why in general, the data lake is now in need of salvation. The point of saving the data lake is to understand how we go from having a repository of data with signals to operationalizing that information to provide value to the business.

Saving the Data Lake

The good news is that lots of smart people are working on figuring out how to take the good ideas that got the data lake show started and create something really excellent.

So far we’ve found that each company and vendor we examine, including data warehouse vendors, has different visions for the future of the data lake. Saving the data lake has become more urgent in an era of data science and AI because both domains require a well-maintained data supply chain. That means not just amassing data, but having a way to process and manage it as well.

Recently, I talked to Paul Barth, CEO of Podium Data. He provided their analysis on why the data lake has failed to this point and his company’s vision for a data marketplace that will achieve the goals of a data lake.

Barth pointed to three main failure modes that have led to problems with the data lake.

The Three Failure Modes of Data Lakes

According to Barth, there are three main failure modes of data lakes:

Polluted data lakes. A polluted data lake occurs when many pilot projects with many tools are unleashed. This leads to lots of experimentation, but nothing that can scale to production. A plethora of tools and projects were focused on the data lake, but that resulted in mounds of data without an organizing structure, metadata, and quality control processes. This meant that it became difficult to find data, or understand its utility or purpose.

Bottlenecked data lakes. Other companies struggled because they treated data lakes as next-generation data warehouse technology rather than an entirely new approach to data. The data lake was carefully curated, with engineers involved in every decision. That leads to a bottleneck,” says Barth. While this team can produce results, it cannot do so at scale and the process is as slow as ever. Because many common and open source data management tools do not meet enterprise demands with regards to load management and data quality checks at ingest, companies created a worse data warehouse experience as they moved to the data lake. “It becomes cheaper storage and processing supporting a poor quality overall system,” Barth said.

Risky data lakes. In an effort to provide access to the data lake quickly for data scientists, some companies have not ensured that policies were applied to sensitive data such as PII. Enterprises need standards and policies to enforce security around masking and managing sensitive data so that it is not exposed to the wrong internal users — or even worse, external ones. But the lack of enterprise class data management tools with data lakes has led to some companies not monitoring, auditing, and controlling access to their data as they should. As a result, they can’t use the data lake at scale. “There’s a need to have data supply chain and data management principles on the foundation of enterprise data management and enterprise class services. If that link is missing you’re never really going to get the value out of data lakes,” Barth said.

The Data Marketplace: Podium Data’s Vision for Saving the Data Lake

Barth calls his vision for a data lake that works a data marketplace. Barth pointed out that one of the main underlying issues with data lakes is that companies are using them both incorrectly and inefficiently. The remedy Podium Data suggests is to combine the foundational capabilities of data warehouses and the analytics and ETL infrastructure that has grown up around them with the key features of a data lake to create a new structure called a data marketplace.

The data marketplace strategy is based on the following fundamental changes in the way data is organized and managed:

Create a transparent catalog and repository for all the data: The data marketplace has the plumbing to allow all data, including data outside the data lake, to be stored, evaluated, cataloged, searched, and accessed. This means including all data in the catalog regardless of where it sits in the enterprise, handling the arrival of new datasets in a seamless way and then being able to reuse those datasets in a cost-effective manner.

Use market feedback to discover the most valuable data sets: With the data lake’s unlimited capacity, companies can link all their data to the marketplace and then learn based on usage (referred to as market feedback) which data is most valuable and how it is being used, cleaned, and improved.

Differentiate gold, silver, and bronze datasets: By persisting data as it moves from raw to ready in its lifecycle and using crowd-sourced feedback from data consumers, companies can begin to identify and categorize the value of their datasets. Even categorizing data in this way ensures that all data has some utility, whatever its ranking, because there is now organization where before there was just a data swamp.

Invest in developing the most valuable data: The data that turns out to be the most useful can then receive the attention it deserves and be cleaned, modeled, integrated, and turned into an easy-to-use product.

Enable self service in as many ways as possible: The data marketplace itself must allow easy exploration of data, use of data inside the marketplace, and integration with existing reporting and analytics systems. Allowing more people to use data directly leads to gathering more about usage, which in turn makes the marketplace more powerful.

But what I want to point out about this vision is that it’s not technology dependent. We already have the technological capabilities to meet this goal. What has been lacking has been the visions of how to craft a data architecture that is right for your company.

Podium Data believes that a data marketplace is a crucial solution to allow IT to organize the data infrastructure and ensure that IT remains relevant to the business. Without a stronger vision for the way companies should approach their data, IT itself is at risk.

So, as the challenge now is not one of technology, but of setting a vision, companies have to decide how to incorporate a new set of requirements to get the most out of their data. Podium Data's approach to this is that the requirements you need are going to be highly individualized and dependent on the business.

Even within one company, there may be the need for multiple requirements to be met. Marketing may not need the precision that the accounting department requires. Groups with regulatory mandates may have strong compliance requirements that drive the need for data that is 100% accurate, while those doing exploration for product development purposes may prefer to have larger datasets to work with, and 90% accuracy is all that they require. The data lake must be able to employ multiple approaches as needed by different applications and groups of users.

In each of these realms, companies need to assess good, better, and best options for their particular needs. That's the key to resurrecting the data lake — solving the data requirements that deliver on your business demands with appropriate technology. To achieve this, the technology is far less of a problem than the integrated vision and charting a course forward.

The Power of a Data Marketplace

Data lakes as they were originally conceived are worth saving, but the real victories will come from creating a new vision like the data marketplace. Barth asserts that data marketplaces empower companies along with a number of dimensions, enabling them to:

Optimize the data supply chain. Applying the marketplace approach Barth discussed, companies can see which high-value activities are associated with their data over time. This can then change how they manage, store, improve, and use their data. “Each investment of effort in improving the quality or usability or value of data in that supply chain is something that you can build upon and not throw away,” Barth said. “That’s where a data warehousing technology approach is massive, one size fits all and not the right approach for the data lake. If we’re doing it incrementally, we want to do it in a shared, collaborative environment and make the product of our work available to others.”

Know what’s there. Even though the quantity of data that data lakes can store is staggering, when you organize them in the right way, using a marketplace strategy, you can determine, finally, what data you have — and what data you’re missing. That’s vital because you can gain intelligence from a variety of data sources and using the underlying metadata.

Automate metadata collection. With automatically collected marketplace metadata, you can build a repository that allows you to productize your data in new ways. You can rapidly build a catalog that explains an enormous amount about each dataset and allows it to be used as a foundation for building larger datasets. Knowledge of each dataset, its usage, and its potential to be a building block should drive your strategy. “Companies need to stop looking at the individual pieces of glass in the stained glass window and think about the whole window,” said Barth. “When you have a marketplace, you get a bigger picture based on the metadata that’s extremely helpful.”

Drive agile development.For agile data management, Barth pointed out that it’s not about perfection, but priorities. With a marketplace-based catalog of your data, you can identify opportunities and challenges and start to prioritize what you’re going to work on for organizing and cleaning up, making the data lake more and more usable.

Make the data lake the go-to place.Once the data lake is organized and more transparent, replete with valuable metadata, it can become the “go-to” place for your business. You build a critical mass of data that then draws in additional data.

For Barth, to create a data marketplace, you need to follow these steps:

Rapidly build a catalog to assess what you have

Determine the priorities

Run an agile process

Deliver meaningful business results as early as possible

Craft a change management strategy for creating critical mass.

Experts Are a Success Factor

One major challenge to implementing the data marketplace vision is the silos of expertise that have built up over time to address the complexity needed for many different functions in an enterprise scale data warehouse. In my conversation with members of Podium Data’s leadership team, they mentioned how there are usually numerous experts. You have a person who understands data quality, and another who knows what you need on the reporting front.

But what’s lacking is stitching all that expertise together into a larger, cohesive vision, so that you’re not just a gerbil running on a wheel focused on optimizing one narrow function. It’s truly rare that companies have a vision that takes their in-house talent and combines it with the right technology to achieve both agility and transparency.

And you need both. For agility, the idea is that a new dataset can arrive and can be understood, made available, and documented in such a way that it can be used widely by people on their own without having to have long conversations with someone who is the steward of the data. When it comes to transparency, it means users have a full set of information about the data that goes beyond just data lineage and when the dump truck delivered the data to the business — it’s about knowing what the data meant from its inception. Only once you have this type of insight into your data can you truly understand its quality.

Companies achieve data transparency with data warehouses because of the use of canonical data models. Yet data in data warehouses was trapped in slow processes that lacked agility. The data warehouse data was well understood but couldn’t evolve at the speed of business. The data lake wasn’t able to correct this problem because companies didn’t implement lakes with a sufficiently comprehensive vision. That’s what they need to do now. They need to address how to present data to their employees and have their IT staffs create a surface area and portfolio of technology (which should include data lakes) that becomes a functional marketplace to meet their data needs.