Should we design death into our programs, processes, and threads at a low level, for the good of the overall system?

Failures happen. Processes die. We plan for disaster and occasionally recover from it. But we rarely design and implement unpredictable program death. We hope that our services' uptimes are as long as we care to keep them running.

A macro-example of this concept is Netflix's Chaos Monkey, which randomly terminates AWS instances in some scenarios. They claim that this has helped them discover problems and build more redundant systems.

What I'm talking about is lower level. The idea is for traditionally long-running processes to randomly exit. This should force redundancy into the design and ultimately produce more resilient systems.

Does this concept already have a name? Is it already being used in the industry?

Avoid at all costs

We should design proper bad-path handling and design test cases (and other process improvements) to validate that programs handle these exceptional conditions well. Stuff like Chaos Monkey can be part of that, but as soon as you make "must randomly crash" a requirement, actual random crashes become things testers cannot file as bugs.

No exit (code)

Adding random exit code to the application should not be necessary. Testers can write scripts which randomly kill the application's processes.

In networking, it is necessary to simulate an unreliable network for the sake of testing a protocol implementation. This does not get built into the protocol; it can be simulated at the device driver level, or with some external hardware.

Don't add test code do the program for situations that can be achieved externally.

If this is intended for production, I can't believe it's serious!

Firstly, unless the processes exit abruptly so that in-progress transactions and volatile data is lost, then it's not a honest implementation of the concept.

Planned, graceful exits, even if randomly timed, do not adequately help prepare the architecture for dealing with real crashes, which are not graceful.

If real or realistic malfunctions are built into the application they could result in economic harm, just like real malfunctions. Purposeful economic harm is basically a criminal act, almost by definition.

You may be able to get away with clauses in the licensing agreement which waive civil liability from any damages arising from the operation of the software, but if those damages are by design, you might not be able to waive criminal liability.

Don't even think about stunts like this: make it work as reliably as you can and put fake failure scenarios only into special builds or configurations.

The code that cried wolf

The problem I see is that if such a program dies, we'll just say "Oh it's just another random termination—nothing to worry about." But what if there is a real problem that needs fixing? It will get ignored.

Programs already "randomly" fail due to developers making mistakes, bugs making it into production systems, hardware failures, etc. When this does occur, we want to know about it so we can fix it. Designing death into programs only increases the probability of failure and would only force us to increase redundancy, which costs money.

I see nothing wrong with killing processes randomly in a test environment when testing a redundant system (this should be happening more than it is) but not in a production environment. Would we pull a couple of hard drives out from a live production system every few days, or deactivate one of the computers on a aircraft as it is flying full of passengers? In a testing scenario—fine. In a live production scenario—I'd rather not.

42 Reader Comments

It's a novel design, but I think it's a bit of a leap to force developers into having to code for the possibility that your program might die at any time.

Programmers should be considering program interruption at all times, though deadline constraints make this one of many areas that can suffer or be forgotten entirely. Even if your code is going to run in a data-centre you have no guarantee there won't be a power loss, or that an uncaught memory leak will cause your program to be terminated, so it's worth designing parts of your program so that they can fail gracefully.

The most common example is saving files; many (probably most) programs overwrite files directly, but this is a horrible practice, as if the save operation is unexpectedly terminated half way through then you end up with a corrupted file. Provided you have free space available you should always save files to a temporary location and then move them into place. Fortunately some APIs enforce this behaviour, but it's a good example of the kind of thing you need to consider.

Use of transactional APIs is another good one; it can be a huge pain in the ass to code everything to use transactions, but if you get into the habit of doing it for all major data manipulation then it can result in a far more resilient system overall.

I wish I could think of the name of it, but there's a software engineering methodology like this. Basically you introduce "X" bugs into the system and then make sure all "Y" failures are directly attributable to the known bad. But it's a development and testing process, not something you leave in for the real world.

Part of what I do involves instrumenting code in Dev AND Prod for clients. Application systems are so complex that pretty much anything more than "Hello, world" is tough to know about.

What I mean is that in EVERY engagement we do, there comes one or more moments where the App team says "we didn't know it did THAT, or hooked THIS, or connected to THERE" with stunned looks on the faces in the room.

Basically, there are untold numbers of chaos monkeys, apes, and gibbons already in production environment and they sling shit without any additional help.

I dunno, I'm no programmer but I could see this being useful (or even necessary) if you need very robust code, like military or NASA grade. If you're programming for a space probe, the circuits are probably going to get hit with cosmic radiation (no shielding is perfect), so you're going to get some bit that are truly flipped randomly. What happens then? End of mission.

If you design that into your code, it could help you create software with tons of redundacy and error-checking and careful assertions everywhere. And if you can track where the "random death" strikes, you can find code that's not yet hearty enough to withstand a vigorous blast of radiation. Therefore you can have it recover gracefully (and not lose your Mars rover).

I'm guessing it would also make your code a whole lot longer, and a lot slower.

Program a multi-threaded server to spawn a separate service thread or process for each client, request or session, preferably with the latter having minimum privileges. Make the spawner process or thread long lived and do your longevity engineering on the design/implementation of that, and put the complexity into the service thread. If the service thread has a very occasional memory leak or other difficult to replicate error, you still want to know why as it may be a security issue, but that's less likely to be a problem causing loss of service as it's short lived compared to the same fault in the longer lived spawner process.

I dunno, I'm no programmer but I could see this being useful (or even necessary) if you need very robust code, like military or NASA grade. If you're programming for a space probe, the circuits are probably going to get hit with cosmic radiation (no shielding is perfect), so you're going to get some bit that are truly flipped randomly. What happens then? End of mission.

If you design that into your code, it could help you create software with tons of redundacy and error-checking and careful assertions everywhere. And if you can track where the "random death" strikes, you can find code that's not yet hearty enough to withstand a vigorous blast of radiation. Therefore you can have it recover gracefully (and not lose your Mars rover).

I'm guessing it would also make your code a whole lot longer, and a lot slower.

EDIT: I ___ a word.

For programming in fault intolerant environments such as Military/Avionics/Medicine, it's usually a good idea to code in a language such as ADA or SPARK. These languages have a lot of features (strong typing, robust exception handling, etc.) that allow for the compiler to do a huge amount of error checking before the code even runs. And because they are compiled languages, they also run much faster than interpreted. You should never use an interpreted language (I'm looking at you, Java) for mission critical applications.

Furthermore, it's a horrible idea to have "random death" in production code. In dev code, that's another story. For example I was working on a fault tolerant virtual file system that was stream agnostic; one of the things I did was implement a BorkedStreamWrapper class, that dropped data, lagged, and all around misbehaved, so I could reliably test the fault tolerance. But things like that, or Netflix's Chaos Monkey are fault testing tools. If you're not looking for data to improve fault tolerance, you shouldn't be introducing faults on purpose.

For programming in fault intolerant environments such as Military/Avionics/Medicine, it's usually a good idea to code in a language such as ADA or SPARK. These languages have a lot of features (strong typing, robust exception handling, etc.) that allow for the compiler to do a huge amount of error checking before the code even runs. And because they are compiled languages, they also run much faster than interpreted. You should never use an interpreted language (I'm looking at you, Java) for mission critical applications.

Mission critical applications are an alien world to most folks used to web development and the like. Concepts like deterministic runtime and centralized/paired device hooks are pretty easy to manage, but you're unlikely to run across them in a standard computer science curriculum - or in application development outside of embedded systems.

However, it's worth noting that no matter how well you manage your own code, you're still at the mercy of external APIs. In the embedded world, you either write the APIs yourself or pull them from a validated library (and you have a much, much smaller set of APIs since you only use what you need). In web development, probably 90%+ of what you're doing is someone else's code - and chances are, it's someone else's rather sloppy code.

In the real world, daemons are written so that they kill themselves and respawn. This is to deal with unclosed file descriptors, unused memory, and other resources that are not properly returned to the system. (Sometimes the system won't allow it.) There ain't much to debate; no program should run forever. Those that do, have their own death built into them.

This is a ridiculous question. Nobody writes algorithms which shut themselves down randomly. The most important principle of software engineering is separation of concerns. I think that a more fair question is to ask whether programs should be routinely fuzz-tested (abrupt shutdown is one of such test cases) as a part of a QA cycle.

I think that fuzz testing in a production environment is stupid... it adds additional entropy in a system that's supposed to be stable. Ultimately a decision like this comes down to two things: cost of failure and control of environment. Do I want to have random processes and programs shutting down while I am flying in an airplane? In a nuclear power plant? How about on a Mars rover? On my smartphone while I am talking to somebody? Of course not.

Do I care if programs shut down abruptly on Netflix? Netflix provides entertainment, I won't die if I cannot stream movies or tv shows; the cost of failure is low.

Netflix's software also runs in a hosted environment which is constantly under their control. I would never recommend doing something like this for software which runs on a remote (inaccessible) environment.

From the other side of the product life: would anyone even buy this if it is designed to fail at random? You wouldn't buy a car that broke down at random, or a fridge that turned off randomly. You would probably sue for damages if your car did that and caused you to crash. A crashing program can be just as harmful, especially if lives depend on it. Would you go to a hospital if you knew its software crashed randomly?

Programs should be built to handle any situations, but building in more errors simply leads to more overhead in testing, and legal/sale issues. No sane programmer should ever do this if they value their career.

I wish I could think of the name of it, but there's a software engineering methodology like this. Basically you introduce "X" bugs into the system and then make sure all "Y" failures are directly attributable to the known bad. But it's a development and testing process, not something you leave in for the real world.

You can do this with test cases, when you want to test the handling of an error. You can then trigger a situation that causes an error, to test if the code handles this right. I think a better way is to create error-handling that covers all known errors, and attempts to catch any unknown errors as best as possible. But building this into real world programs as "redundancy"? Very bad idea.

when developing ,use unit test and fuzz testing . you should never release a product designed to fail or be worse on purpose . the only reason for trying to crash a program is in bug testing and refining error handling , this is done when developing the product for the consumer not after. the consumer will only get irritated/angry or like some people have mention that the result will end in a catastrophe.the reason Netflix is getting away with Chaos Monkey is because they are the developers and they only do it in macro-mode and probable use unit test and fuzz testing before product deployment .

I think someone here understood what the Chaos Monkey was for and what it should achieve.If you have a server farm with ay above 99% uptime, AWS should be that, you basically never really test your safety nets. It's like that fire extinguisher in my basement, that is never used except when my house actually catches fire.The fire extinguisher now is well tested, at least I hope so, to make sure it works when it's needed. But how do you do that with distributed apps? Yes, that was the idea of Chaos Monkey. Slightly tip on the handle to make sure, by releasing a tiny amount of the powder, that the fire extinguisher works and there's nothing blocked or has gotten old.

Now, the idea to built this kind of test into the program itself has two problems. One is the isolation between product and test. You ideally want the test situation to be as real (sometimes severe) as possible, so to not fall for your own blindness of making the disaster scenario exactly what you foresaw when designing the safeguard, you make the testing external from your product and let someone else design it.The other problem is that you don't want negative effects of tests for actual use. What would you say if the fire extinguisher in your basement leaks a little bit of its powder every weekend? How about some actuator and microcontroller that pseudo-manuall pushes the handle a bit from time to time?

I am all for designing software robust. During testing, all kinds of weird scenarios, abuse and poking around are fine until you can't find any possible scenario that kills your software or results in data loss, delays, etc.But what could you win from putting the actual users up with the inconvenience of crashing software?

You'll want to test redundancy. That can be done via a manual fail over which should always be an option for events like planned maintenance. This will need to be extensively tested as this is what production should be using for redundancy testing. That isn't quite the scoop of the question and more of a logical exception.

Issues like application crashing, network disruptions and hardware death can be simulated from outside of the application and thus should be. It keep the core application code focused toward its purpose. Testing this should be isolated to non-prod environments like dev or test. Thus any issues in the staging or production environments can be isolated to misbehavior in the application code itself and not it 'working as designed'.

Chaos monkey is a nice tool for testing an application. The question then becomes should I run this in production? The answer is not only no but for many cases hell no. Netflix can get away with chaos monkey in prod due the services that it offers - video streaming. Ideally the only user noticeable impact would be a hiccup in video playback and the user likely forgets about it once video begins playing again. Any person willing to try chaos monkey in a prod environment for say a medical application deserves to be fired. Yes, chaos monkey should be able to play in prod but what if it finds an issue that the application doesn't gracefully handle? The use-case from Netflix and delayed streaming video has changed to potentially some medical instrument failing and worst case some one dying. I would not want to be the one explaining to the FDA that your software is working by design when it unplugged the patient.

Either I misinterpreted the question, or everybody here is missing the point. If a system is designed to handle losing programs/processes/instances at random, it has to have some major fault tolerance and data duplication built in. People are complaining that it would cover up bugs and whatnot. That would be the point - a process terminating due to a bug would look the same as a process terminating at random, and be handled in the same way. So the bugs don't matter as much as normal.

It might be a bad idea, but you guys are commenting as if the whole system is meant to fall over.

Either I misinterpreted the question, or everybody here is missing the point. If a system is designed to handle losing programs/processes/instances at random, it has to have some major fault tolerance and data duplication built in. People are complaining that it would cover up bugs and whatnot. That would be the point - a process terminating due to a bug would look the same as a process terminating at random, and be handled in the same way. So the bugs don't matter as much as normal.

It might be a bad idea, but you guys are commenting as if the whole system is meant to fall over.

I think everyone missed the point. Programs are never perfect and there are always more bugs. So instead of trying to chase them forever you make a program that can handle it without disruption.

I'd rather do both.

Nobody is saying you shouldn't try to make your application more fault tolerant. Fault injection has its place in the dev/test cycle, certainly, but introducing random crashes in production environments will only inconvenience users, at best, or bring down the whole system, at worst.

The point of things like Netflix's Chaos Monkey is that they provide data on how faults were handled, in order to improve upon. They just don't go shutting down servers at random for the hell of it. If you're not seeking to improve fault tolerance (read: testing) then why would you be introducing faults on purpose? Yep, 2+2 still equals 4. That is to say, doing so only reaffirms things that you already knew, things that would not change.

No fault, however well handled, is entirely without cost. So why would you incur that cost if you're not getting something in return?

Cells in our stomach die daily and are continuously being replaced. However, fingers, or teeth or eyes do not regrow because carrying the genetic code to regrow them would be too expensive.

Translation to software: plan for failure where you think failure is most likely but beyond that, there isn't enough money in the world to design 100% resilient software.

Too bad you were downvoted, I agree completely.

If you have a large enough engineering budget to make things work with random crashes, then this type of thing is a good idea (although I prefer the idea of a daemon that randomly reboots the physical machine, instead of implementing it in your code).

But in the real world almost no project has that kind of budget available.

However, fingers, or teeth or eyes do not regrow because carrying the genetic code to regrow them would be too expensive.

Nearly all of your cells contains the entire genetic code to regrow an entire new person, including new sets of teeth.

Genetic algorithms might be a place where failure could pay off. But you probably wouldn't call it failure then.

You missed the point.

It is true that all our cells contain our dna but said dna can only be used to grow a copy of the organism as well as a replacement of the parts which are most likely to fail or wear out in life's daily grind. If you cut yourself you will regrow the skin. Hair, muscle, some parts of the liver and fingernails will regrow. The heart, eyes, teeth... will not. The genetic code would have to have "subroutines" to replace those and that would simply be too expensive.

If you write a program that will be used occasionally, there's no point in spending time and money to make it perfect or to care for memory leaks.

On the other hand, if you write a program that needs to stay on 24/7 you will design some resilience into it but you can't possibly account for all the reasons it would fail. It takes hundreds of millions of dollars to code and test programs that have this kind of complexity. Very few companies are able to afford these budgets.

Urs Hölzle, Chief Engineer of Google's datacenters, has a saying, "At scale, everything breaks." He is absolutely right. Your hardware will fail in one hundred different ways, your bulletproof subsystem will crash, or some bozo will wipe an operating system. In a large, complex enough system, you need to develop fault tolerance for the different parts of the system.

How do you test your fault tolerance solution? By introducing faults! Definitely do this in a test environment first, but you also need to validate your fault tolerance system in the slightly different production environment. In addition to the differences of the two environments, production is the only place where users will complain.

However, of equal important to injecting faults is actually watching and making sure your fault tolerance system works. If it doesn't, you need to stop injecting that kind of fault. In a large enough system with a good fault tolerance system, the validation can be performed automatically. This is Netflix's Chaos Monkey: testing Netflix's fault tolerance system.

So, in most cases I don't think it's a good idea to randomly terminate your process as it runs. However, if you have a fault tolerance solution, you may want to randomly terminate some of the processes (or disable a VM, or reboot a server, etc.). Before you consider it though, you need a good fault tolerance story and a way to measure how that fault tolerance is performing. Also, you'll need an easy way to turn it off in your environment; it's Murphy's Law that as soon as you inject a fake failure, you'll suffer a real one!

Credentials: I work on the fault-tolerance solution of the a large, complex SaaS, and our code automatically injects failures to great benefit.

My fear would be folks would end up spending so much time writing code to handle out-there situations of random failures from other things (that were coded to randomly terminate to give code down-stream chances to become more robust in handling failures) rather than focusing on the specific task their code is supposed to handle.

In Taguchi quality assurance, the idea is to figure out what the optimal conditions are that the product will work in. Build the product to withstand a band of conditions at greater extremes that stem out from the optimal, but spend most of your time fine-tuning the product to work best at the optimal condition. In this way there was some "hardening" of the product, but by all means it wouldn't cover every dooms-day scenario around.

But coding programs to randomly terminate just to fire-drill other code ... great in a test environment, horribly bad in production.

Why would a programmer keep using a code lib or something that randomly fails on them for stupid reasons, the worst of all being that the code was designed to self-terminate just to fuck with the developer and see how their code works in "fire drill" mode.

This sounds like someone is trying to push a QA/testing method into the wild as a "value added" proposition. Sort of like "maybe we should give babies handguns, that way the stupid ones kill themselves before they grow old enough to do others harm." Great philosophical debate, but we're not going to start handing out handguns to babies IRL.

Urs Hölzle, Chief Engineer of Google's datacenters, has a saying, "At scale, everything breaks." He is absolutely right. Your hardware will fail in one hundred different ways, your bulletproof subsystem will crash, or some bozo will wipe an operating system. In a large, complex enough system, you need to develop fault tolerance for the different parts of the system.

How do you test your fault tolerance solution? By introducing faults! Definitely do this in a test environment first, but you also need to validate your fault tolerance system in the slightly different production environment. In addition to the differences of the two environments, production is the only place where users will complain.

However, of equal important to injecting faults is actually watching and making sure your fault tolerance system works. If it doesn't, you need to stop injecting that kind of fault. In a large enough system with a good fault tolerance system, the validation can be performed automatically. This is Netflix's Chaos Monkey: testing Netflix's fault tolerance system.

So, in most cases I don't think it's a good idea to randomly terminate your process as it runs. However, if you have a fault tolerance solution, you may want to randomly terminate some of the processes (or disable a VM, or reboot a server, etc.). Before you consider it though, you need a good fault tolerance story and a way to measure how that fault tolerance is performing. Also, you'll need an easy way to turn it off in your environment; it's Murphy's Law that as soon as you inject a fake failure, you'll suffer a real one!

Credentials: I work on the fault-tolerance solution of the a large, complex SaaS, and our code automatically injects failures to great benefit.

There's a difference between injecting failure into good code (i.e. Chaos Monkey sending kill commands) and purposefully writing flawed code (i.e. coding the program to randomly fail without external intervention). The latter is what people are objecting to.