James Dixon Imagines A Data Lake That Matters

While much has been said and written about the data lake, in the end, the concept has stayed pretty simple. The idea has been to use Hadoop as a place that data of all types could be stored in greater detail than ever before, at an affordable cost, and then used to power the existing data warehouse ecosystem and also perform new types of analytics.

This week, James Dixon, the CTO of Pentaho and the creator of the term “data lake”, presents a challenge to the big data community in his latest blog “Union of the State - A Data Lake Use Case”. Dixon argues that it is time to start figuring out how to make the data lake a time machine for a business.

Here is his idea in a nutshell:

Most of the applications in use capture the state of the enterprise in some way.

A fraction of the changes to the state are captured via change logs.

Dixon suggests to store all of enterprise data in a data lake and effectively create change logs for every field.

In addition, Dixon suggests to capture behavioral data for how applications of all types were used.

in this way the data lake comes a time machine that allows the state of the enterprise at any one moment to be captured and analyzed.

The power to see what happened before and after important events will lead to even better predictive models and a deeper understanding of operations.

What I like about Dixon’s visions is that he gives the data lake an ambitious mission. But the how is important. In his blog Dixon addresses a few implementation details. I want to go further here add some suggestions for how to make the Union of the State vision a reality.

Building a Union of the State Machine

Dixon’s vision is that by either using change logs, databases logs, or taking a snapshot of the state of a database, you will be able to create the time machine for enterprise data of sufficient granularity to matter.

When I started thinking about Dixon’s vision it occurred to me that there must be a way to make creating the Union of the State machine, one that has full history, by using some existing technology. Instead of a naive implementation in which you derived change logs from snapshots using direct comparisons, why not use databases that already have timestamps embedded in them. Teradata’s Integrated Data Warehouse has such a capability. So does Datomic, a so-called immutable database that preserves all state ever entered. Using one of these two approaches could make building a Union of the State machine much easier.

The second way these technologies would help would be in running queries. Both Teradata and Datomic have query languages that allow you to set the point in time for the state you want to consider and discover. I’d be interested in finding out about other repositories that have the same ability to handle a time stamp. Splunk could do the time stamp just fine, for example, but I suspect would have a harder time with the aggregations, although I’m sure a Splunk master could manage it.

The next challenge would be creating the predictive models that looked at the way the data changed over time. Nutonian could play a hugely positive role here to automatically create models. All of the statistical modeling tools such as R and equivalents could also play a role.