DataStax Blog

When failure is not an option for your big data system

Having tackled the fundamentals of the peer-to-peer architecture in my last post, I now want to take a look at the section of this paragraph from our recent press release that touches on one of the most important aspects of running a mission-critical, big data system. The paragraph reads:

Customers this year chose Cassandra time and time again over competing solutions. The peer-to-peer design allows for high performance with linear scalability and no single points of failure, even across multiple data centers. Combine this with native optimization for the cloud and an extremely robust data model and Cassandra clearly stands apart from the competition for enterprise, mission-critical systems. [emphasis added]

While there are definitely some difficult concepts in the world of big data, “no single point of failure” isn’t one of them. Someone a little more business oriented may well prefer the phrase “continuous availability”, which is the result of having a system with no single points of failure. In either case, it basically means your system will remain available even under extreme circumstances because it is designed for failure.

You may have read that last sentence and said, “Typo! You surely didn’t mean your system is designed for failure!”

But that’s precisely what I mean. Notice that I didn’t say “your system is designed to fail.” I said your “system is designed for failure,” meaning, that architecturally it is built in such a way that assumes the components that make up that system will individually fail (maybe even very frequently) but that the larger system as a whole will remain available. As Google said many years ago in this paper on its distributed file system: “First, component failures are the norm rather than the exception”.

This idea of accounting for component failures is often equated with the idea of “scaling out” or “scaling horizontally”, which is the opposite of “scaling up” or “scaling vertically”. When you scale horizontally, you add more machines to your system to increase capacity. When you scale vertically, you add more capacity to your single machine. For years, relational databases have handled increased capacity by scaling vertically, and aside from other challenges that causes, it introduces a single point of failure that jeopardizes continuous availability.

But here’s the thing that many people don’t realize: scaling horizontally does not eliminate the challenge of a single point of failure. To truly achieve continuous availability, you have to understand the system architecture, which goes back to the discussion in my last post around the differences between “master/slave” and “distributed peer-to-peer.” Read slaves in the master/slave architecture introduce certain limitations, and there’s no easy way of getting around that.

Conversely, Cassandra’s fully distributed architecture means every node is the same. Every node is a master, and every node is a slave. You never have to worry about how and when to add nodes of a certain type to increase capacity. And that also means you never have to worry about losing a node in your system. Failover testing isn’t required in Cassandra because Cassandra is constantly failing over from the moment you start your first cluster.

Or said more succinctly, Cassandra is truly built for continuous availability when failure is not an option.