The Ripple Effect

The Trouble with Hadoop

Whenever IT folks talk about handling their big data problems by scaling out with Hadoop, I tend to think about the 1986 comedy, “Big Trouble in Little China.” It chronicles the mishaps that ensue when a trucker gets dragged into a mystical battle in Chinatown. It’s kind of awful, but with John Carpenter in the chair and Kurt Russell on the screen it still delivers some laughs.

Hadoop is similarly awful. Except with Hadoop, nobody’s laughing.

The reasons for this have recently been enumerated in multiple write-ups arguing that Hadoop is simply not a very good scale-out solution for all but the most limited applications.

Let’s run a roll call on some of these observations about Hadoop, its shortcomings, and the reasons it appears to have stagnated.

It’s failing to deliver the future it promised. Hadoop may be a victim of the hype assigned to it by the media. Because it was largely associated with big data, and because big data caught the attention of many enterprises that were working with data that wasn’t nearly as big as they expected it would be, Hadoop begat many implementations that failed to deliver any measurable value. This hasn’t done much for Hadoop’s reputation in data centers worldwide. As Kashif Saiyed noted on KDnuggets: “Hundreds of enterprises were hugely disappointed by their useless 2 to 10TB Hadoop clusters – Hadoop technology just doesn’t shine at this scale.”

It’s limiting. Hadoop allows you to run only a single general computation and offers you precious little flexibility to accomplish even that. Too much of it is fixed, which is a problem for organizations needing a scale-out solution that conforms to their needs, rather than the other way around.

It ain’t easy. As big data consultant Richard Jackson writes, enterprise-class Hadoop implementation will likely require significant Hadoop expertise, and many organizations have found themselves ill equipped internally to address that need, and unwilling to invest the necessary sums to outsource it. When polled by Gartner in 2015, 54 percent of respondents had no plans to adopt Hadoop, and 57 percent of those said the skills gap was the primary reason. Indeed, as one blogger noted, system administrators can be lulled into thinking they’ll be able to manage a Hadoop production cluster after running a small problem on a small test cluster. “This is rarely the case,” wrote Kim Rose in a post listing the “common and time-consuming” problems Hadoop administrators face, from the complex and largely manual task of scaling to “Hadoop’s often arcane errors and failure modes.” Then there’s the need to worry about network design. On this point, Rose cites a GigaOm study reporting that most corporate networks are not designed to accommodate Hadoop’s need for high-volume data transfers between nodes.

It’s a heavy lift for occasional needs or specific applications. Most organizations that aren’t Amazon, Facebook or Google (and that leaves virtually all other organizations) may need a way to scale their computing capability beyond their largest server, but it can be tough to justify the cost of a Hadoop roll-out just for that. The 2015 Gartner survey found that virtually half (49 percent) of enterprises with no plans to implement Hadoop simply couldn’t see the value in it. Ouch.

A better way

Those are the highlights. The truth is, until now, if your model, data set or problem was bigger than your biggest server, you had three options: scale up by buying a bigger server, scale out by distributing your problem across clusters (via Hadoop and others), or limit the size of your problem to match the size of your server. The first option costs money; the second costs time and money; the third costs time. None of these options is desirable when insights are important and delaying them threatens productivity.

The operative phrase above is “until now.” Because now you can easily, quickly, and flexibly scale beyond the limits of a single server with Software-Defined Servers.

Software-Defined Servers allow you to scale your computer to match the size of your problem, completely on demand, by combining multiple commodity servers into one or more virtual servers. You gain access to all the resources associated with those servers – cores, memory, I/O, etc. – and your software sees them as a cohesive whole and not a distributed array of disparate but networked components. This is a far less burdensome alternative to Hadoop or any other traditional scale-out scheme that requires extensive recoding or sharding of large data sets, and a much more desirable approach than shrinking the size of your problem to fit your available computer.

At TidalScale, we enable organizations struggling to handle large problems to implement Software-Defined Servers of virtually any size via our HyperKernel software. The HyperKernel works between your unmodified operating system and your bare metal commodity servers. (And when we say “unmodified,” we mean it. We’re talking zero changes to your OS, your applications or your model.)

The TidalScale HyperKernel also delivers enormous flexibility to IT administrators who must maintain and manage dozens, even hundreds of servers that are underutilized for much of their service time. And it dynamically allocates resources within the aggregated Software-Defined Server so workloads are concentrated near the resources they need the most, thus reducing the need for extraordinarily fast interconnects. In fact, our implementation of Software-Defined Servers performs exceptionally well across standard 10GB Ethernet networks.

For more details on how this all works – and how Software-Defined Servers represent a much less troublesome option than Hadoop for scaling beyond a single server – I encourage you to check out our recent webinar with Data Science Central.

Good post. I think the one area I would add in terms of constraints is trying to build an SLA equivalent to traditional systems. The standard tooling with any Hadoop distribution is based on labour-intensive MapReduce scripting + Hadoop cluster-intensive batch copying, which across multiple data centres is expensive in terms of people time & idle assets and results in RPO & RTO nowhere near zero minutes. Technologies like WANdisco can deliver selective active-active replication in near real time across multiple data centres. For full disclosure I work with WANdisco.

Musings on the universe of problems that are impacted by Software-Defined Servers.