Big Data, Cheap Storage Bring In-Memory Analytics Into Spotlight

In-memory analytics, like virtualization and the cloud, is an old idea that's been given new life. In this case, the combination of big data, inexpensive commodity storage and parallel processing make it possible to analyze terabytes of data without slowing systems to a crawl.

By Allen Bernard

CIO|Dec 6, 2012 7:00 AM
PT

If you're paying attention to big data, lately you've probably heard terms such as in-memory analytics or in-memory technologies. Like many tech trends that appear new only because their histories are obscured by newer and sexier tech, or because time has yet to catch up with them—server virtualization and the cloud are just reinventions from the mainframe days, after all—in-memory is a term being resurrected by two trends today: big data and cheap, fast commodity storage, particularly DRAM.

"In-memory has been around a long, long time," says David Smith, vice president of marketing and community for Revolution Analytics, a commercial provider of software, services and support for R, the open source programming language underpinning much of the predictive analytics landscape. "Now that we have big data, it's only the availability of terabyte (TB) systems and massive parallel processing [that makes] in-memory more interesting."

If you haven't already, you'll start to see offerings, including SAP HANA and Oracle Exalytics, which aim to bring big data and analytics together on the same box. Or you can also get HANA as a platform supported in the cloud by Amazon Web Services or SAP's NetWeaver platform, which includes Java and some middleware.

Meanwhile, analytics providers from SAS, Cognos, Pentaho, Tableau and Jaspersoft have all rolled out offerings to take advantage of the in-memory buzz, even if some of these offerings are mere bolt-ons to their existing product suite, says Gary Nakamura, general manager of in-memory database player Terracotta, a SoftwareAG company.

"They're saying, 'Hey, we're putting 10 gigs of memory into our product capability because that's all it can handle, but were calling it an in-memory solution,'" Nakamura says. The question, he adds, is whether they can scale to handle real-world problems and data flows. (To be fair, Terracotta has just released two competing products, BigMemory Max and Big Memory Go, the latter of which is free up to 32 GB. Both products scale into the TB range and can run on virtual machines or in distributed environments.)

In-Memory Technology Removes Latency From Analytics

"What is comes down to," says Shawn Blevins, executive vice president of sales and general manager at Opera Solutions, is that each product has "an actual layer where we can stage the data model itself, not just the data—and they exist in the same platform and the same box in flash memory."

From a business point of view, this is really what matters. In-memory technology gets complicated quickly. If you want to understand how all the bits and bytes line up, then it's probably best to call down to your IT guys for another rousing round of "What's that part do again?" However, if you want to understand why in-memory is becoming the buzzword du jour, that's a little easier: It provides business insights that lead to better business outcomes in real-time.

Essentially, in-memory analytics technology lets businesses take advantage of performance metrics gleaned from production systems and turn those into KPIs they can do something about. A company such as Terracotta can give away 32 GB of capacity because in-memory analytics doesn't require the entire fire hose of data that a traditional BI app needs in order to produce useful results.

"The deal with in-memory analytics is the analysis process is all about search," says Paul Barth, co-founder of data consulting firm NewVantage Partners. You're trying to see how many different combinations of things, such as blue car owners and ZIP code, are correlated, he adds.

For every one of those correlations, it takes time to pull the data, cluster it, find the dependencies and see how strongly one variable is affected by the others. Every time you pivot that table to find something new or get some clarity, data moves and gets reorganized. That introduces latency—which is the problem in-memory analytics is precisely designed to defeat.

"You can do a lot of those analyses in just very rapid iterations and say, 'Look at it this way, look at that way,'" Barth says. On the other hand, "if you're pulling off a disc, it could be a whole other query. Every time you have to do an iteration, if it takes a minute or two to pull that data out of memory [and] I want to do that thousands of times, then it's taking me a half an hour to an hour to run through an analysis vs.…less than a minute if I just flip this on its head."

High-Frequency, Low-Computation Analysis—For Now

At this stage of the game, big data analytics is really about discovery. Running iterations to see correlations between data points doesn't happen without milliseconds of latency, multiplied by millions (or billions) of iterations. Working in memory is at three orders of magnitude faster than going to disk, Barth says. "Speed matters in this business."

Ever wonder how Facebook can tag you in a photo as soon as it goes live on the site? A photo is a big file, and Facebook has Exabytes of photos on file. Facebook runs an algorithm against every photo to finds faces and reduces those faces to a few data points, says Revolution's Smith. This reduces a 40 MB photo down to about 40 bytes of data. The data then goes into a "black box," which determines whose face it is, tags it, searches for that person's account and all the accounts associated with person, and sends everyone a message.

That's big data at work. But it's also how in-memory analytics makes big data work. Currently, most people don't put more than 100 MB into an in-memory cache at any one time because of Java's limitations. The more data that's put into memory, Nakamura says, the more you have to tune the Java virtual machine. "It gets slower, not faster, and that is problematic when you are a performance-at-scale play." (Terracotta's Big Memory product line gets around this issue.)

For now, in-memory analytics is well-suited to high-frequency, low-computation number crunching. Of course, when you have Terabytes of DRAM or Flash storage available to run real-time analytics against, that behavior will change. In this case, the technology needs to catch up to the need, not the other way around. The need exists, the data exists and, based on the number of announcements coming from Hadoop World in October, the technology is on its way. No chicken and egg here.

Allen Bernard is a Boston native now living in Columbus, Ohio. He has covered IT management and the integration of technology into the enterprise since 2000. You can reach Bernard via email or follow him on Twitter @allen_bernard1. Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.