Netflix unleashes Chaos Monkey

Netflix continues its open source pledge, bringing an AWS drill sergeant to make sure your infrastructure is ready for blackout.

Over the past few months, video streaming behemoth Netflix have
been giving back to the developer community, open sourcing a number
of interesting tools, namely two Cassandra helpers Astyanax and Priam.

Now the technical team say it’s the right time to uncage the leader
of their so-called ‘Simian
Army’, Chaos Monkey, a veteran that tests how malleable
their Amazon Web Services infrastructure is during times of
strife.

Available through Github,
Chaos Monkey hunts down groups of systems and randomly
terminates virtual machine instances in applications, to simulate
what would happen in a disaster scenario. Writing in the Netflix
technical blog, Cory
Bennett and Ariel
Tseitlin tell us why Chaos Monkey should be present in
your architecture:

We have found that the best defense against major unexpected
failures is to fail often. By frequently causing failures, we force
our services to be built in a way that is more resilient.

Failures happen and they inevitably happen when least desired or
expected. If your application can’t tolerate an instance failure
would you rather find out by being paged at 3am or when you’re in
the office and have had your morning coffee? Even if you are
confident that your architecture can tolerate an instance failure,
are you sure it will still be able to next week? How about next
month? Software is complex and dynamic and that “simple fix” you
put in place last week could have undesired consequences.

Whilst this might sound like some sort of mercenary taking out
your infrastructure, the benefits of having Chaos Monkey there are
numerous. It makes sure engineers are responsive and alert when
it’s all hands to the pump, making sure that your infrastructure
can cope and has a battleplan in Code Red scenarios.

If anyone is well placed to deal with situations like this, it’s
Netflix with their massive cloud infrastructure. Crucially, with
scrutiny on Amazon Web Services growing with more blackouts, it
always seems to be Netflix who come out of it unscathed, no doubt
in part due to their Simian Army – a ruthless squadron of tools
that puts the Netflix architecture through its paces.

The stats are impressive too:

There are many failure scenarios that Chaos Monkey helps us
detect. Over the last year Chaos Monkey has terminated over 65,000
instances running in our production and testing environments. Most
of the time nobody notices, but we continue to find surprises
caused by Chaos Monkey which allows us to isolate and resolve them
so they don’t happen again.

With some tweaking to Chaos Monkey’s REST
API, you can initiate a testing scheme to make sure you
aren’t caught out. The great news is Netflix intends to deploy its
entire army moving forward, with environment tidier Janitor Monkey
and something they call Chaos Gorilla, which simulates an entire
AWS zone outage. Get this tool now to make sure AWS doesn’t make a
monkey out of you.