From the frontlines of the data storage vs in-memory war

The battle between big data vendors touting the virtues of traditional data warehousing and analytics and those that push in-memory is heating up. But where does that leave everyone else, i.e. everyone trying to get their big data projects done? And where are the kernels of truth in the battling rhetoric coming from the vendors?

In order to illuminate the differences between the two approaches and the associated claims, let's take a look at one vendor from each category and compare the two.

I chose Teradata for this exercise as it is the one remaining pure play in data warehousing and its chief technology officer Stephen Brobst recently made statements at a Teradata analytics event that are recorded in a post in ZDNet and therefore easy for us all to see and study.

For the opposing side, I chose Kognitio, an in-memory platform provider. I interviewed the company's Chief Innovation Officer, Paul Groom, to get his response to Brobst's public statements.

Both Teradata and Kognitio, as well as all other players in this space, are invited to respond further. This is not a closed discussion but rather an ongoing sorting of fact from spiel, and of science from marketing speak.

That said, let's get on with the comparison of our two examples from opposing sides of the battle.

According to the ZDNet post, Brobst said that while in-memory processing offers obvious speed improvements, "anybody who talks about putting all the data in memory and big data in the same sentence, has no idea what they're talking about."

At first read he appears to give a nod to in-memory but then continues to dismiss it. Is his dismissal accurate? Technically, yes, but only by virtue of careful wording.

"A great sentence distorted nicely by one small word – all," explains Groom. "'All of the data' needs to reside in one or more persistent stores to guarantee its availability for future access and use."

That being the case, what data should be in-memory?

"In-memory is relevant for data that is under the microscope, data that is being analyzed in detail with many complex methods such that there is high-frequency of access and low latency has increasing value," explains Groom. "A traditional query may only touch a given piece of data once or twice—scan to find, include into aggregate. Complex analytics may touch a piece of data hundreds or thousands of times depending on the algorithm and number of times that the algorithm is run, e.g. trying iterations with different parameters to find optimal score."

Ah, you might say, that sounds like the two vendors are saying much the same thing. And they pretty much are in the broad brush but, as always, the devil is in the details. Specifically in data flow.

According to the ZDNet post, Brobst said that the Hadoop cluster may be slower than front-line disk or SSD but it "allows you to capture that data effectively. Once you find value in that data, you can promote it into your warehouse using ETL techniques. You have to do what makes sense based on the size of the data you're working on."

Database access is almost always the bottleneck in the extract, transform, load (ETL) process. The short acronym belies the clunkiness of the process. Since subsets often can't be easily identified, too much data has to be extracted and the subsets identified later. While the time can vary for this process, it typically isn't real-time friendly.

Hence the need to speed and smooth data flow. That is the aim of best-of-breed in-memory platforms.

"Data can now flow. We have for the first time in computing history a balance in power/capability between storage tech, CPU tech and network tech," says Groom. "We can read data from a persistent store (storage) and bring the required data to memory via network – the persistent store runs the section process (scan) and the in-memory platform provides the compute platform (CPUs) to grind the data. This avoids the situation where the traditional siloed database is overloaded with mixed workload of load, tune, query, analyze. Each platform can be optimized for storage (Hadoop) or computation (in-memory)."

There is chatter in the Hadoop community that the old storage model is outdated and no longer warranted – and that bodes ill for players like Oracle and Teradata if such a shift in storage strategies does come to be. But there will always be a need for storage nonetheless.

"Storage cost will be considerably cheaper without an Oracle or Teradata license but that support contract for the Hadoop cluster will not be free and few commercial businesses will run production systems depending on web forum support alone," says Groom.

Have something to add to this discussion? Please do so in the comments below or send me an email with your thoughts. - Pam