Jeff Atwood recently wrote a blog post about Netflix's implementation of a "Chaos Monkey". It is a very high level article. I am curious if anyone has actually implemented this technique for testing a system.

I guess what Im really trying to ask is: What strategies do you implement to ensure your architecture can survive a part of the system crashing ?

1 Answer
1

Isolation and graceful degradation are the general strategies. (Another term you might see that is similar to isolation is decoupling, though I tend to see that on a smaller scale, such as in OOD/P. The concept is the same.)

You isolate different parts of a system from each other so that if one is down, the others can still respond to requests. Like the Netflix blog said, if searching wasn't working, streaming would still be fine. This just means that searching and streaming where separated enough that a bottleneck or incapacitation of one did not affect the other.

With graceful degradation, if the best implementation of something is not available, you have something else fill in. Again from the Netflix post, they have a system for looking at the things you've watched and liked and then working out personalized recommendations of other things to watch. If that system is down, they fall back to showing recommendations of things that are popular overall. The point is to have a Plan B, Plan C, etc. to do or show something when Plan A fails rather than showing nothing or an error.

One common client-side example of graceful degradation (whether implementation is common or not) involves the use of javascript on websites. If the browser's javascript is disabled or simply unavailable, the site's pages should still operate successfully without it. It may not be as fast or slick, but it should still work rather than become unusable.

These are very general ideas, though. Just about every project would implement them differently, depending on the services and subsystems they provide, and the dependencies between them.