License to SIGKILL

Every program wants to live forever. What happens when a program is forced to exit before it’s done running, and why would we want to do that?

Unix Signals

Feel free to skip if you are familiar with signals.

In Unix, processes can communicate to each other with pre-defined signals. You can see a list of unix signals here. This ability to communicate is extremely important in a process oriented program. For example, the Puma webserver can add concurrency by spawning child “worker” processes. It accepts requests into a master process and then hands them off to the next available child. If the system that is running the Puma master process needs to shut down or restart, we don’t simply want all current requests to be stopped in their tracks. Instead, we want the child workers to finish processing the request if they can, clean up any external connections or temporary files they may have generated, then exit. The system can safely do this by sending a signal to the parent “master” process which is then, in turn, sent to the child processes.

You may have seen the movie Tron Legacy. The movie opens with a hacker breaking into a corporate network. The CEO sees it happening and deftly responds by typing in a $ kill -9 command into the terminal. This kill command in linux (and Mac OS X) sends the signal number 9, which is SIGKILL, to a process. SIGKILL means “end now without cleanup”. This is similar to using CTRL+ALT+DELETE on windows (though windows is not POSIX compliant and doesn’t support processes).

When we need our long running processes to exit gracefully, the signal SIGKILL is too strong. That signal forces processes to exit immediately and can leave your system in a bad state. What should you use instead? The signal SIGTERM (signal number 15) is the “termination signal”. This tells a program that it needs to stop what it’s doing and clean up before exiting.

Live and Let Die

When Ruby receives a SIGTERM signal it raises a SignalException error. In Ruby, an exception can happen at any point while a process is running, so critical clean up code should always use an ensure block.

begin
# do something
ensure
# clean up something
end

There are notable caveats here, such as an exception can be raised while an ensure block is already running, so we can’t always rely on it to execute. For an in-depth look into errors in Ruby, I recommend Avdi’s Exceptional Ruby. That being said, it’s still a best practice to use ensure blocks to safeguard your code.

Since we already have this failsafe behavior, Ruby uses it when a SignalException is raised. To verify this we can write a trivial script:

You’ll notice that, in addition to the “ensure called”, we also get the number of the signal that was used to exit the process (15, which corresponds to SIGTERM). Neat. This behavior is really convenient, since any program that has error handling is already equipped to gracefully exit. By putting sensitive operations in an ensure block, we’re making it more likely that the program will do the right thing. After all, the ensure blocks get called then the program will exit. Note that if you re-run the program with SIGKILL instead, it exits with a different number and we don’t get output from the ensure block.

Tomorrow Never Dies

I’m sure you’ve had a frustrating app on your computer that was frozen and wouldn’t die no matter how many times you clicked the “close” button. Some stubborn programs will never exit, no matter how many times you send SIGTERM to them. This can happen when the program gets stuck trying to clean itself up. We can reproduce this easily:

ensure called
ensure called
ensure called
ensure called
ensure called
# ...
ensure called

It will never end until the machine is restarted or SIGKILL is sent. Instead of this trivial example, it’s easy to imagine your Ruby program waiting on a database query or network call to finish. If it is hung and your program never gets a response, it will never exit. That’s why it’s always critical to timeout sensitive code, though be careful with timeout.rb.

It’s important to note that all ensure blocks in scope will be called when a SignalException is raised. This means that, in addition to your own code, all the codes in any dependencies will be called. If your system is hanging on exit and you can’t determine an errant ensure block that you’ve committed, it may be from a library you’re using.

Say (Dr.) No to Signal Trapping

Another way that you can prevent a program from exiting is to use Signal.trap. When you run this code, the signal will get captured and the program will not exit.

When you execute the program, you get the output "Die Another Day" but it continues to execute. It is possible to trap and re-raise the same signal, however this is a very large hammer. We can’t depend on a signal being sent to the program, nor can we rely on this code getting run in the block. Worse yet, when we do get a signal, the system needs us to clean up and exit as quickly as possible. The best practice would be to use ensure blocks whenever possible and only resort to signal trapping when it’s really necessary.

From Russia with Love and Signals

So far, we’ve looked at how your Ruby code handles signals, but how would you know what signals to send? Before any restart or shutdown you would send a SIGTERM to let it clean up, then monitor the process to see if it shuts down in a reasonable time frame. If it doesn’t, send a SIGKILL to shut down the process, ending any infinite ensure blocks. You would make a note of when your process does not exit from a SIGTERM as it could mean that, when you force kill the process, you’re interrupting some important work or cleanup process. The company I work for, Heroku, goes through these steps every time you deploy or restart your application. If, for some reason, your application won’t exit on time, the system emits an R12 – Exit Timeout and records the error on your dashboard view so you can investigate later.

While it’s difficult to conceptualize, an exception might stop your entire program at any time, so it’s nice to know that adding ensure to places that should already have them is all you need to do to be safe. Whether you’re working for Her Majesty’s Secret Service or in an IT department at a Casino Royale, you can take a Quantum of Solace knowing that your programs can exit gracefully.