How Chaos Engineering Can Bring Stability to Your Distributed Systems

To prepare for this article, I asked a bunch of my engineering and agile coaching friends and colleagues if they deploy the emerging practice of chaos engineering. I got a lot of the same response — “Of course, every day.” or “Haha that’s my life.” But they were more joking about how they try to herd cats or practice their own form of rogue coding.

Certainly, every IT manager grumbles when trouble plagues their operations. But streaming service Netflix actually courted such havoc, all in order to make their system stronger. Engineers for the streaming service gave a name to this practice,chaos engineering, which they defined as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Or, if you are Alexei Ledenev, chief of research at Codefresh, chaos engineering is all about recognizing and managing the complexity of both systems and people. At this year’s Container Camp he spoke about how over the lifetime of a project, different trends in architecture, various integrations, and constantly updating external services and APIs, combine with evolving teams to make your story very complex.

“Building software that unintentionally becomes complex is easy, but it is hard to maintain complex software,” Ledenev said.

Chaos engineering is, therefore, embracing the potential for failure and looking at it as an opportunity to become a more flexible, adaptable team with more flexible, adaptable architecture. And by accepting that you’ll have failure, you can control the failure and get to know your system — and by extension the team building it — better.

What Is Chaos Engineering?

Chaos engineering was introduced and dubbed by Netflix architects as they were building out their infrastructure on Amazon Web Services. Their version of preparing for failure by embracing the chaos as:

Embrace the failure. Control the failure and see how your system behaves.

Then, in very unchaotic, methodical fashion, you throw the kitchen sink at your code. Except it’s not the kitchen sink, but rather specifically testing your systems up against the risks and flaws most pertinent to it.

Once you follow these three axioms, you can then learn not only about your system but what makes them stumble and stall. Ledenev says this can involve randomly injecting pseudo faults, like terminating virtual machines, killing containers, and changing networks.

“Try to discover weaknesses or deviation from the norm,” he said.

In chaos engineering, as you try to achieve stability at scale, you experiment following these four steps:

Define that ideal state of the system’s normal behavior.

Create a control group and an experimental group.

Introduce real-world wrenches, like changing servers.

Try to find the difference or weakness between the control and what is crashing.

“The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large,” the Website Principles of Chaos Engineering states.

Just remember, every variable you throw at your code should be reflecting real-world events. This usually will be what has happened to your code already, but sometimes can be from experiences you’ve heard from other, similar, distributed systems.

Then you prioritize those events by either frequency and/or magnitude of that failure occurring. And don’t just consider fault in your code, but also hardware failures, servers dying, and security issues.

One final rule of chaos engineering is that you have to run these experiments in production to know truly how they will react, dependent on your environment and traffic patterns, something that you can’t mimic as well in a staging area.

Tools and Animals to Help You Create Controlled Chaos Testing

Then it’s about automating the wrenches you are throwing at your technology. Sure you can run manual tests to set up your chaos testing, but, in order for it to be successful, you need to create automated testing so you can not only continually test for the known cases, but to also spare time to identify new cases which may not be as big of an impact, but can help improve the performance and stability of your systems.

Of course, there are tools that help you throw risks and flaws at your systems. And they all seem to be named after animals. First, Netflix created Chaos Monkey, a resiliency tool that pseudo-randomly slaughters its AWS virtual machines. This has evolved into a whole Simian Army, a suite of open-source tools to, well, throw poop at your applications and network systems.

Ledenev was searching for a “Chaos Monkey for Docker,” but couldn’t find anything specific to Docker or container clusters. So he created one. Pumba is a chaos testing warthog (remember Lion King?) for Docker. He found that as a best practice for chaos testing that it is possible to define a repeatable time interludes and duration parameters to better control the chaos. With this, he built Pumba to distribute to a single Docker host, a Swarm cluster, or a Kubernetes cluster. It can do things like:

Stop running Docker containers.

Kill the send termination signal.

Remove containers.

Stop a random container once every ten minutes.

Kill a MySQL container every 15 minutes.

Kill random containers every 5 minutes.

Pause the queue for 15 seconds every 3 minutes.

Now, it may seem random to give so many examples from one small open source testing tool, but this is what gets to the crux of chaos testing. It’s about knowing your product well enough to know what makes it most often fail. These are ways that containers often fail.