Part 3 (this post) is a brief speculation as to Kudu’s eventual market significance.

Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:

It’s plausible; just not soon. What I mean by that is:

Success will, at best, be years away. Please keep that in mind as you read this otherwise optimistic post.

Nothing jumps out at me to say “This will never work!”

Unlike when it introduced Impala — or when I used to argue with Jeff Hammerbacher pre-Impala — this time Cloudera seems to have reasonable expectations as to how hard the project is.

There’s huge opportunity if it works.

The analytic RDBMS vendors are beatable. Teradata has a great track record of keeping its product state-of-the-art, but it likes high prices. Most other strong analytic RDBMS products were sold to (or originated by) behemoth companies that seem confused about how to proceed.

RDBMS-first analytic platforms didn’t do as well as I hoped. That leaves a big gap for Hadoop.

I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.

Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.

Comments

[…] Part 3 is a brief speculation as to Kudu’s eventual market significance. […]

Adam F on
September 30th, 2015 1:42 pm

What’s your take on the relative merits of Parquet-in-HDFS versus Kudu? (I know, apples and oranges – but still seems like a real decision that application architects will face) It seems like the key additional capability of Kudu versus Parquet is the ability to update existing records versus just append. But I wonder about the relative value of this compared with the cost of introducing a whole new storage system into the already complex Hadoop environment if the main goal is analytics. After all, lots of stuff can already talk to HDFS and Parquet, and Parquet doesn’t require any additional running services, just some client libraries. And it seems like analytics have long been orienting more and more toward appends versus updates – for example, log-based systems like Kafka that model everything as updates/messages, but even traditional dimensional data warehouse systems where maintaining accurate history means implementing modeling approaches like type 2 slowly changing dimensions so you only add to history instead of overwriting it. Given all this gravity towards append-based data, I can’t help but wonder if a lighter-weight library-based columnar storage framework like Parquet might not have better Darwinian odds than another additional active service added to the Hadoop stack. Unless perhaps real OLTP support comes, and the argument becomes “all in one / OLTP + analytics.”

1. You’re streaming data, but you’re not putting everything into the analytic store — just what you’ve identified as “anomalies” or “events”.

2. Data arrives out of time sequence.

3. You’re enhancing the streaming data with information from a tabular store (e.g. customer records). Those get updated from time to time.

4. Heck, you want to replicate your whole business transaction data store into the place where you stream your web logs.

#1 is more IoT. #3-4 are more internet marketing. #2 is both.

Patrick Angeles on
October 5th, 2015 3:29 pm

@Adam

Very valid question. In exchange for a marginally more complex deployment architecture (it does add another component to the Hadoop zoo), I believe it will greatly simplify the data architecture.

Consider the Type 2 case, which you brought up. To get that working right in HDFS, you’d need to deal with incremental ingest, merge the new data with the historical, and compact files in order to maintain decent scan performance.

This is a common pattern that’s been implemented in a number of places, but it’s non-trivial and fragile, and the latency between the receiving the data and that data showing up in queries is measured in minutes, not sub-second. Plus — merge/compact involves shuffling tables and views around, so the query consistency guarantees aren’t great. (Mostly because the Hive metastore currently does not support versioned metadata.)

Kudu takes care of all this. And again, given that several production clusters are already running a combination of ZK, Hive, MR, Spark, HDFS, HBase, Oozie, etc… and that we have cluster management tools that make it easy to manage these services, then adding another service like Kudu is a small price to pay for a much simpler data architecture.