Deflating the Hype Over In-Memory Databases

Moore’s law has doubled transistor density on both RAM memory and processors since the beginning of industrial computing. RAM doubling continues unabated, providing previously unheard-of supplies of main memory. However, we have reached the fundamental limits to create bigger and faster single Von-Neumann processors and are now in the cloud age of radically multi-core parallel-distributed “scale out” compute platforms.

On top of these changes in hardware, radical new ways to virtualize hardware create new opportunities for a radically more elastic and software-defined data center. Virtualization and scale-out power new ways of thinking about system stability, including a shift away from “reliability,” where giant expensive systems never fail (until they do, catastrophically), and towards “resiliency,” where thousands of inexpensive systems constantly fail—but in ways that don’t materially impact running applications.

Along with new forms of memory such as Hybrid flash and the fundamental shift to parallel distributed processing, a completely new data architecture is being born. NoSQL database vendors such as MongoDB (which has raised $223 million in venture capital) are attacking the strongholds of Oracle and IBM which were born in the scale-up “Big Box” age of the mainframe computer, while big data vendors such as Cloudera (raised $141 million in venture capital) are radically distributing the data warehouse through commodity scale-out hardware, open source and distributed computing.

In addition, new applications such as consumer Internet, mobile, cloud applications, public APIs, the Internet of Things and “Big Data” are forcing software architects to inherently re-evaluate elastic scalability both in terms of “burst scale” as well as new heights of peak load and unpredictability.

Out of this environment, both Oracle and SAP have announced huge, hardware-driven in-memory database projects. This article will focus on how expensive and proprietary “Big Box” solutions currently flying under the banner of “in-memory databases” are an architecturally wrong approach to today’s requirements for elastic scalability for modern applications. New distributed ways of using the vast pools of RAM provided by Moore’s law will be needed, but the “giant database machine” architecture is a relic of the past.

OK, so before I dissect the rubbish claims of Oracle 12c and SAP HANA, let’s agree on a few things:

Yes, Memory is Fast

Memory is one million (1,000,000x) times faster than disc. RAM time is measured in nanoseconds, Disc time in milliseconds, as is network time.

What’s important about this is that, as soon as you touch the network or the disk, you lose most of the advantage of RAM speed. This is why when you run an application on your laptop, the code is “in-memory,” the data is “in-memory,” and both are on the same machine.

Yes, In-Memory Computing is Growing Exponentially

Moore’s Law creates an exponential supply, doubling the transistor density over a fixed period of time, applying to both RAM and processing.

Welcome to the Hardware Jungle

Moore’s Law details the reliable doubling of transistor density at the same price. This has resulted in a proportional expansion of processor speed as well as memory size.

But in Herb Sutter’s seminal blog post “Welcome to the Jungle”, it becomes clear that we are reaching fundamental limitations to the scaling of processor speed, and the days where mainstream processors represent single processor Von-Neumann machines is over. Sutter paints a picture of a world where we are forced to either “scale in” with many cores per processor, or “scale out” with many processors.

In this endgame, the idea of the code and the data living together on the same machine, like in the Mainframe days, starts to make less and less sense: it’s just going to get more and more radically distributed. Virtualization and elastic demand curves are enabling new parallel-distributed architectures to emerge, which are challenging the traditional database and traditional data warehouse paradigm.

Welcome to the Software Jungle

If, in an ideal scenario, code and the data are living together on the same machine, why isn’t enterprise software built like that? Here are some approaches and challenges.

One Giant Machine: Mashing data and code onto one big mainframe is expensive and proprietary.

Put Code in the Database: Storing procedures inside of databases has traditionally resulted in vendor-lock in and unmaintainable spaghetti code.

Put the Data in the Application Tier: This is hard because of concurrency. Two application servers can try to write the same local record: which one wins?

What makes life even worse is “Service Oriented Architecture.” Let’s say you’re a retailer with an eCommerce Website, a mobile app and retail stores. Let’s say that you have dozens of different applications accessing the same database. How do you provision for “Black Friday,” when all of these applications will suddenly consume a lot more resources? We don’t know which applications are going to grow, nor do we know by how much. So trying to provision database resources for this specific timeframe is a nightmare. It’s also inefficient to buy a huge Exadata box with 32TB of RAM just for Black Friday, only to leave that box 99 percent idle for the rest of the year.

Trying to just scale up the database tier by using lots of RAM is the wrong way to think about the problem. As with all architectural perspectives, the correct answer is “it depends,” but the impulse to put the data in the application tier and handle the complexity of concurrency is a positive impulse for cloud, SaaS and Service Oriented applications, if not for most modern application architectures.

The Benchmark That Matters

Your app is the benchmark that matters.

First of all, databases already hold a lot of data “in-memory.” The problem is it’s often on a different machine. Yes, you can buy an Exadata box from Larry Ellison with 32 Terabytes of RAM for “only” $3 million. Yes we can produce synthetic database benchmarks that show 100x improvement in database performance. SAP is moving lots of hardware with the HANA “appliance” and moving big boxes “full” of HANA. But what’s really happening?

What’s happening is that the database is “faster” but the application isn’t, because the application still has to reach over the network to another machine.

You can build your architecture “two-tier” and put the database server on the same machine as the application server. This approach will work in some cases, especially if the machine can simulate scale-out through generating the many virtual nodes that represent database and application server machines, all connected through the motherboard. But again, these tend to be very expensive mainframe-style-computing examples, rather than commodity hardware.

Distributed Computing has Arrived

The reality is that Database Admins and IT Operations people see pain in database performance. Sure, it seems like a good idea to buy bigger boxes to make this pain go away. But we are fundamentally moving into a world of distributed data and distributed processing. Hardware and software architecture needs to evolve beyond Service-Oriented Architecture (SOA) and into a more fully distributed approach. This has already been achieved in the Data Warehouse with the Hadoop ecosystem, a way to transform commodity scale-out hardware and proprietary star-schema databases into an open distributed data and processing engine for the enterprise.

The real solution for application scalability may indeed be in-memory computing, but the answer will come from an open systems (read: “non-mainframe”) approach for distributing code and data that brings the data and code together on the same machine. One positive for Oracle in this race is the Oracle Coherence product, which is able to bring database data up into the application tier and in-memory. But Oracle would rather sell you a very large, expensive Exadata box.

Right now, with both SAP and Oracle embracing the concept of “In-Memory Database,” the hype level has reached epic proportions. To be honest, there has always been big iron to throw at big data—that’s just a matter of replacing proper IT systems architecture with expensive hardware and software. But it’s time to cool off the fake benchmarks and rhetoric and think clearly about the most cost-effective approaches to application performance and scalability. The new architecture powered by scale-out distributed commodity hardware has arrived.

Miko Matsumura is a Vice President at Hazelcast, an open source in-memory data grid company. He is a 20-year veteran of Silicon Valley.

Image: Konstantin Yolshin/Shutterstock.com

Related Jobs

Comments

Yes, distributed sounds great, until you realize that it implies network and its latency. That is where big box like Oracle M6-32 has big advantage. Coherence, Hazelcast, HANA are all distributed architectures, and all involve dealing with problem of single node memory limitation, data partitioning, network latency, JVM memory limitations, Java performance etc Yes, you will pay a bunch for Oracle mainframe, but you will have single memory address space and truly ‘ungodly speeds’, all transparent to your applications, old and new.

What they meant by faster is data retrieval times especially with regards to internet users who are connected to database powered websites. As we all know, DSL speeds aren’t necessarily keeping pace with Moore’s law and a slow database server based on older architectures only compounds the problem.