Now That You’ve Gone Swimming in the Data Lake – Then What?

Summary: What happens after you make those critical discoveries in the Data Lake and need to make that new data and its insights operational?

image source: EMC

Data Lakes are a new paradigm in data storage and retrieval and are clearly here to stay. As a concept they are an inexpensive way to rapidly store and retrieve very large quantities of data that we think we want to save but aren’t yet sure what we want to do with. As a bonus they can be unstructured or semi-structured data, streaming data, or very large quantities of data, covering all three “Vs” of Big Data. The great majority of these are Hadoop key-value DBs which is reported by many technical reviewers to have unstoppable momentum.

What brought these into existance is of course NoSQL technology which arose originally to solve the pain points of not being able to store and retrieve the volume, variety, and velocity of Big Data. But that was just the start. Once the technology was here something else interesting happened. Data consumers, the analysts, data scientists, and line business users figured out that it could be used to solve another giant pain point, waiting for IT to deliver the data we wanted to analyze.

This isn’t to point a finger at IT (though they were a favorite target when only RDBMS was available). In the RDBMS-age data had to be ETL’d (cleansed and loaded) into data warehouses and to do that the structure of the tables had to be decided in advance. That meant knowing what questions needed to be answered before anyone even got to poke around in the data.

Remember those bad old days. My favorite story is about my big K-12 client for which I was doing employee and payroll analysis. This required an extract of about 1.2 million lines and 20 or 30 features. Using their brand new EDW it took 48 hours to run the query and I liteally had to stand in line for 30 days before the time became available. Ugh.

And after Data Lakes, here’s what we got.

Self service data access. Users could now extract their own data and examine it to their heart’s content, especially now that we have SQL-on-Hadoop tools. No more waiting for IT to load and extract our data which could add weeks or months to time-to-insight.
Really inexpensive storage. In general it appears that storage on Hadoop is running $1,500 to $3,500 per Terabyte (HW+OSS), compared to about $35,000 per Terabyte (on an appliance) for EDW.
Text, image, voice, click stream or any other type of data we cared to capture. Hadoop doesn’t care. Alot of this is time series data like clickstreams, web logs, sensors, social media, and web/NoSQL mobile-facing operational data.
Schema on read: We don’t need to know what questions will be asked in advance. This is a true data sandbox for exploration and discovery.

So the takeaway that many DB developers would have you believe is ‘Hadoop Good’, ‘RDBMS Bad’.

But wait. RDBMS EDW hasn’t gone away and won’t. That’s where we keep our single version of the truth, the business data that record legal transactions with customers, suppliers, and employees. We also get strong SLAs, strong fault tolerance, and highly curated data based on strong ETL, provenance, and governance. Those are all things that are missing in our Data Lake.

So back to the title of this article. After we’re done exploring the data in the data lakes and have made significant discoveries we want to ensure those new insights can be operationalized. During the discovery process we increased our understanding of the data, added structure, and came to understand how to determine and fix data quality issues. In short, we added knowledge of how to productionize the data. Basically that means formalizing our relationship with the required data including strong ETL, provenance, governance, and for sure we want those SLAs.

One option would be to go to IT and have them add the new data to the core EDW. But with a cost difference in the range of 10X to 30X just for storage, plus management and administration, IT is open to any suggestion that would offload these new requirements.

Adjunct Data Warehouse: I recently came across this concept in a Gigaom Research Report by George Gilbert suggesting that an intermediate step between Data Lakes and EDW would be the solution of choice.

Using the Adjunct Data Warehouse provides for production ETL, reporting, BI, reasonably good SLAs, and reasonably good governance. Most importantly the underlying technology is SQL-on-Hadoop so costs are low (avoiding the CapEx of expanding the EDW).

Here’s an intersting chart from the Gigaom report contrasting the characteristics of the Data Lake, the Adjunct Data Warehouse, and the Enterprise Data Warehouse. The goal is to keep the new data off the EDW, at least until it is fully integrated into operations, to keep costs low, but to add back the structure, reliability, and trust we need to accept the data in an operational scenario.

Much of that involves business processes and human supervision but there is one more technological step that needs to be considered. What the Gigoam report points to is using ‘SQL-on-Hadoop’ DBs (note that we’ve pretty much stopped calling them NoSQL). But these come now in two distinct flavors and if you want to pursue this idea you would have to pick.

The first is traditional key-value SQL-on-Hadoop which is schema on read (or late schema). However, matching this for cost and scalability are the NewSQL SQL-on-Hadoop solutions that Gartner recently started calling ‘Avant Garde’ or ‘Emerging’ RDBMS. These are full ACID, have horizontally scalable MPP architecture, but are in fact RDBMS meaning they are schema on write.

So with nearly equal costs and capabilities you could consider either a schema-on-read or a schema-on-write solution. Both would lay appropriate formal groundwork needed for operational data but retain accelerated time-to-insight offered by Data Lakes and keep costs low.

About the author: Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. Bill is also Editorial Director for Data Science Central. He can be reached at: