Failure is the last thing you want when running a huge network, particularly one that supports a multi-billion dollar business. But preventing failure requires practice and good planning—and that's why Netflix developed software that attacks its own network more than 1,000 times a week.

By forcing Netflix engineers to recover from small failures that customers won't notice, the company hopes to prevent major outages in its video streaming service. Netflix calls the software it built to automate the process of causing failure a "Chaos Monkey," and today announced the release of Chaos Monkey's source code onto GitHub under the Apache License.

"We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient," Netflix engineer Cory Bennett and executive Ariel Tseitlin wrote in the Netflix tech blog today.

Like many businesses, Netflix hosts its infrastructure on the Amazon Web Services cloud. This allows companies to build out huge clusters of servers and storage without operating their own data centers, but it doesn't insulate them from failure. Businesses that run infrastructure on Amazon have to think about what happens both when Amazon services suffer outages and when their own software causes downtime.

Netflix's Chaos Monkey is "a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact," Netflix explained. "The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption."

Specifically, the Chaos Monkey randomly terminates virtual machines Netflix operates in Amazon's Auto Scaling service. In the past year, Netflix says its Chaos Monkey "has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."

The Auto Scaling technology on Amazon's cloud should detect the termination of an instance and automatically configure a new, identical one to replace it. But the Chaos Monkey's random attacks can still suss out problems, like a patch gone wrong or a traffic load balancer that's failing to route requests around offline instances. While Netflix uses the Chaos Monkey on Amazon, it's flexible enough that it can be installed on other public cloud networks. By default, it only runs during business hours, so people are around to clean up the Chaos Monkey's mess when it identifies a serious problem.

Amazon's cloud infrastructure is divided into data center regions (like the East Coast or West Coast), which in turn are divided into availability zones. Customers are more likely to survive Amazon failures if they build systems that can fail over across availability zones or regions. Building across regions is the most expensive option, but also the most resilient, as failures have occurred across multiple availability zones on numerous occasions.

Last year, customers like reddit, Foursquare, and Quora experienced first-hand what can happen when multiple availability zones are hit with the same problem. Just last month, a power outage followed by the failure of Amazon's primary, backup, and secondary backup power systems took down many virtual machines and storage volumes in Amazon's East coast region. And yes, even Netflix was taken offline by another outage at the end of June.

As such, Netflix's error detection efforts have to go beyond the scale of individual virtual machines. Netflix detailed its Chaos Monkey one year ago in a blog post that also revealed plans for various other chaos-inducing "monkeys." There's a Latency Monkey that introduces artificial delays into Netflix's REST-ful client-server communication layer to simulate service degradation, and a Conformity Monkey that shuts down instances that don't adhere to best practices. There's even a Chaos Gorilla that acts like a Chaos Monkey but simulates an outage of an entire Amazon availability zone.

While the Chaos Monkey is available to anyone who wants it today, there's no word yet on when or whether any of Netflix's other monkeys will be released into the wild. A posting on GitHub describes the Chaos Monkey as the "first member" of Netflix's Simian Army.

Promoted Comments

This is awesome! Most corporate recovery tests pertain only to taking down specific, pre-determined servers. These Netflix monkeys seem to have some intelligence and are more advanced in creating random havoc and keep their infrastructure engineers/developers on guard. More importantly, these tests are performed in production (live) environment as opposed to in test environment, as some companies do in their BCP (business continuity planning) tests. Pretty gutsy!

This is a great testing concept curtailed to cloud computing, too. Way to teach other fortune 1000 how to perform resiliensy test. Kudos, Netflix.

To me, what's interesting about this is not the part where they randomly take instances off line. The interesting part is that they must have embedded enough telemetry in their network to identify any anomalies that result from those take-downs.

Well they're making a monkey out of those servers. Considering some of the conversations in the Ars forums, most corporate infrastructure is just a stone's throw from failure anyway, so no monkey required there.

I wonder if a few of the outages that I have experienced on Netflix, or have seen mentioned on Netflix's twitter, could be the result of said Monkey?

Either way, this is a very smart thing to do. But just make sure you build it up from the start to be very resilient, or you will end up taking your entire service down for a significant period of time the moment you hit the 'on' switch!

@Paul Rodgers: Not really. Since they wrote the software they can defend against it. Re-purposing it so much that it's different enough to pose an actual risk is pretty much as much work, if not more, as developing a whole new attack program.

OT: I think they should loose some real monkeys in their offices so that they are prepared to deal with "planet of the apes" disaster scenarios.

In all seriousness this actually sounds like a pretty good idea. Kudos to Netflix.

Well they're making a monkey out of those servers. Considering some of the conversations in the Ars forums, most corporate infrastructure is just a stone's throw from failure anyway, so no monkey required there.

I think in addition to the code they should unleash a real monkey, one armed with a screwdriver and pair of wire cutters. In order to ensure a truly redundant cloud environment one needs to keep the hardware techs on top of their game as well.

Amazon is a little late to the party here....when paired with DPM, Microsoft Hyper-V has this feature built-in!

In seriousness, I'm not trolling, I just couldn't help myself. In my environment DPM randomly knocks hosts and sometimes the entire cluster offline....my sysadmin says that this is "normal". It's good to see that Netflix takes a more proactive approach to these kinds of issues. Maybe this strategy will catch on and more companies will imbrace stable infrastructure......

If you ask most software engineering teams what happens if a server goes down they'll probably start sweating a little. If they know at all. But everybody knows it's really a question of "when" not "if".

Also, we know that mistakes are cheaper to fix the earlier they are detected. So it must make sense to cause these inevitable failures as soon as possible.

If you ask most software engineering teams what happens if a server goes down they'll probably start sweating a little. If they know at all. But everybody knows it's really a question of "when" not "if".

Also, we know that mistakes are cheaper to fix the earlier they are detected. So it must make sense to cause these inevitable failures as soon as possible.

I concur with sending actual monkeys to the amazon centers hosting the netflix instances. Rabid monkeys on acid. Sent in Dell boxes disguised as servers. This scenario may prove useful to Dell as well, since they may one day need to be prepared for rabid monkeys in server suits tripping balls in their boxes.

In game development, we often let daily builds soak with a "monkey" player that just randomly moves the joystick/mouse and press buttons/keys (obviously the pad input is simulated; we don't have some hardware that randomly sends input). Any crashes are formed into bug reports.

You'd be surprised how many bugs you would find by just hammering the controller. It obviously doesn't catch everything, but it catches a lot of stupid things, and catches them early in development, rather than when you're crunching 80+ hour weeks.

Good to see this sort of thing happening in the server space. Even better to see software available to help automate it for others.

To me, what's interesting about this is not the part where they randomly take instances off line. The interesting part is that they must have embedded enough telemetry in their network to identify any anomalies that result from those take-downs.

This is awesome! Most corporate recovery tests pertain only to taking down specific, pre-determined servers. These Netflix monkeys seem to have some intelligence and are more advanced in creating random havoc and keep their infrastructure engineers/developers on guard. More importantly, these tests are performed in production (live) environment as opposed to in test environment, as some companies do in their BCP (business continuity planning) tests. Pretty gutsy!

This is a great testing concept curtailed to cloud computing, too. Way to teach other fortune 1000 how to perform resiliensy test. Kudos, Netflix.

In game development, we often let daily builds soak with a "monkey" player that just randomly moves the joystick/mouse and press buttons/keys (obviously the pad input is simulated; we don't have some hardware that randomly sends input). Any crashes are formed into bug reports.

That's a different concept though, fuzz testing. Pretty useful to find bugs not only in game development.

Am I the only one here that thinks that this could lead to some really bad stuff down the line?

What bad stuff do think this would lead to?

About the worse thing that could happen is someone introduces a barrel of Chaos Monkeys into their production AWS system without thinking terribly hard about it, causing an outage of epic proportions for their application.

I'd let them loose first in a test environment first, just to make sure they don't cause complete outage.

Is it just the D&D player in me that thinks Chaotic Monkeys are awesome?

In game development, we often let daily builds soak with a "monkey" player that just randomly moves the joystick/mouse and press buttons/keys (obviously the pad input is simulated; we don't have some hardware that randomly sends input). Any crashes are formed into bug reports.

That's a different concept though, fuzz testing. Pretty useful to find bugs not only in game development.

The original Macintosh had a MonkeyLives system variable. It let the system know that the monkey desktop accessory was running and spamming random events. The idea being that they didn't want the monkey code to quit the program being fuzz tested.