2012-07-31

Welcome to Chaos

Netflix have published their "Chaos Monkey" code on Github; ASL Licensed. I have already filed my first issue, having looked through the code -an issue that is already marked as fixed.

Netflix bring to the world the original Chaos Monkey -tested against production services.

Those of us playing with failures, reliability and availability in the Hadoop world also need something that can generate failures, though for testing the needs are slightly different:

Failures triggered somewhat repeatedly.

Be more aggressive.

Support more back ends than Amazon -desktop, physical, private IaaS infrastructures.

#1 and #2 are config tuning -faster killing, seeded execution.

#3? Needs more back ends. The nice thing here is that there's very little you need to implement when all you are doing is talking to an Infrastructure Service to kill machines; the CloudClient interface has one method:

voidterminateInstance(StringinstanceId);

That needs to be aided with something to produce a list of instances, and of course there's the per-infrastructure configuration of URLs and authentication.

Pop up a dialog telling the user to kill a machine (not so daft, good for semi-automated testing).

Issue virtualbox commands to kill a VM.

All of these are fairly straightforward to migrate to the Chaos Monkey; they are all driven by config files enumerating the list of target machines, plus some back-end specific options (e.g. pid file locations, list of vbox UUIDs).

Then there's the other possibilities: VMWare, fencing devices on the LAN, ssh in and issue "if up/down" commands (though note that some infrastructures, such as vSphere, recognise that explicit option and take things off HA monitoring). All relatively straightforward.

Which means: we can use the Chaos Monkey as a foundation for testing how distributed systems, especially the Hadoop stack components, react to machine failover -across a broad set of virtual and physical infrastructures.