Why Testers Need Chaos Engineering

What is Chaos Engineering?

If this is the first time you’re hearing about Chaos Engineering, you might be wondering what it even is.

It was explained very well by Tammy Butow during a recent TestTalks interview.

She suggested we think of Chaos Engineering as being like preventative medicine in that it’s a disciplined approach to identifying failures before they become outages.

Controlled Chaos

The best
approach to achieving chaos is to proactively test how the system responds
under stress. You can then identify and fix failures before they impact your
customers or cause damage to your reputation due to the poor publicity an
outage could receive. (Remember all the bad press received by healthcare.gov
when it first launched?)

The idea of Chaos Engineering is that it compares what you believe will happen in your distributed system to the reality of what really happens.

To learn
how to build resilient software systems, you can use a chaos test tool to break
things in your environment on purpose and see if it actually fails the way you
believed it would.

Here’s
the thing.

You can
always sit in a room and draw diagrams on a whiteboard and form a hypothesis
about how things “might” break, or fail, but you never really know until there’s
an actual failure.

That’s
the big idea here.

It’s all
about carefully thought out experiments, rather than randomly inserting chaos
or injecting failure into your data center. Chaos testing is not about simply
walking into your office on Monday morning and saying that you’re going to bring
down production. That isn’t how it works.

Chaos Engineering
is achieved by using thoughtful, planned out experiments that help to reveal a
weakness in your systems.

In keeping with the vaccine analogy, Chaos Engineering injects a little bit of harm, but it’s for the overall good.

Where did Chaos Engineering Start?

If you’re a tester, the term Chaos Monkey might sound familiar to you. It was heavily talked about back in 2011 when the Netflixteam created it as a way to test server failures during their move from bare metal to Amazon Web Services (AWS) in the Cloud. What’s really cool is that Netflix open sourced it, which means you can download it for free from GitHub.

That’s
the good news.

The bad
news is that it’s not the most user-friendly tool out there…especially for
newbies. But no worries – I’ll let you know about a more accessible option
later in this post.

Before we get to that, however, I want to explain why I think there is going to be more demand for Chaos Engineering than ever before.

Why do you Need Chaos Engineering?

As more and more companies move from using monolith architecture to a more micro-service architecture, many of the engineering teams I’ve spoken with aren’t even 100% certain what each service does or how one impacts another.

In some of
the more extreme cases—especially when it comes to more complex systems— they
don’t even necessarily know what microservices dependencies they have in
production.

As a
tester you’ve probably noticed that it’s getting harder to keep up with the
pace at which our companies are trying to develop and introduce software
solutions to meet the demands of our customers.

It gets even worse when you begin to realize how expensive system downtime is for our companies. Not only are we wasting considerable funds on interruptions, but we’re most likely losing customer loyalty as well.

On the
more human side of things, you’ve probably found that constant interruptions
are causing some engineers to burn out pretty quickly having to support these
outages.

Chaos
engineering is something that can help with a lot of these scenarios.

Are you convinced
yet?

If so, you might be wondering how to get started.

Prerequisites Before Starting Chaos Engineering

Before
you do Chaos Engineering, there are some prerequisites you’ll need to have in
place:

• Monitoring/observability

• On-call and Incident Management

• Cost of downtime per hour

The most
important prerequisite is that you need monitoring and observer ability to know
how your system is currently doing.

Without
it you won’t know or be able to measure how your system behaves as you’re
performing chaos experiments.

You also
should already have in place an On-Call and Incident Management program, as
well as know your cost of downtime per hour.

Knowing
the cost of downtime is crucial to talk about the value that chaos engineering
is bringing into your organization to get your management buy-in.

When you
have these things, you’re also better able to know what services are the most
highly critical in your infrastructure.

The CEO of Honeycomb recently said that “Chaos Engineering without observability… is just chaos.”

You want to know how your system is handling things without Chaos Engineering, and you want to know how your software system is going to handle the chaos experiment as it moves forward.

Chaos Engineering User Test Cases

There are many user test cases for this technique, but these here the ones that are most likely to occur in a real-world event:

Outage reproduction

On-call training

Strengthen new products

Battle test new infrastructure and services

Logs, disk failure

Prepare for launches or high traffic days

Tools to Use for Chaos Engineering

Getting
started with Chaos Engineering can be quite complicated (and a little scary).
But once you’ve been doing it for a while, you’ll become quite good at it, and
you’ll no longer be afraid to run Chaos Engineering attacks. You’ll understand
that it’s a scientific process.

As I
mentioned earlier, you can use Chaos Monkey, which is a great option.

But if you want a more intuitive way that has a friendly UI as well as an API that allows you to programmatically perform all of your Chaos Engineering experiments (including disaster recovery testing) then definitely check out Gremlin Free.

For a good “getting started” guide/demo, check out Ana Medina’s 2019 PerfGuild session on the subject.

One of the best things about PerfGuild is that if you missed it, you can still get the recordings of the live event now and start binge watching. Bonus: you’ll get to view Ana’s Q&A session on Chaos Engineering.

2comments

Interesting that the “from chaos comes order” cliche also applies to software & IT development. This kind of test should be a standard to everything as programs, apps, etc. are always created with perfect-world scenarios in mind.

Copyright text 2019 by Joe Colantonio | TestTalks Privacy Policy Disclaimer All the contents of the Blog, EXCEPT FOR COMMENTS, constitute the opinion of the Author and the Author alone; they do not represent the views and opinions of the Author’s employers, supervisors, nor do they represent the view of organizations, businesses or institutions the Author is a part of. Privacy Policy | Sitemap