Release the Monkeys: Testing Using the Netflix Simian Army

The cloud is all about redundancy and fault tolerance. Since no single component can guarantee 100 percent uptime, we have to design architectures where individual components can fail without affecting the availability of the entire system. But just designing a fault tolerant architecture is not enough. We have to constantly test our ability to actually survive these “once in a blue moon” failures. And the best way is to test in an environment that matches production as closely as possible or, ideally, actually in production. This is the philosophy behind Netflix' Simian Army, a group of tools that randomly induces failures into individual components to make sure that the overall system can survive. Gareth Bowles introduces the main members of the Simian Army―Chaos Monkey, Latency Monkey, and Conformity Monkey. Gareth provides practical examples of how to use them in your test process—and, if you're brave enough, in production.

Gareth Bowles started out as a developer and later graduated to breaking other people's software instead of his own before realizing that his real passion is for shipping product faster, cheaper, and more reliably—while still getting a good night's sleep. Gareth has practiced and managed quality engineering and technical operations at Silicon Valley companies—from six-person startups to major industry players.