Yakaz engineers share their tech knowledge and experience.

09/15/2011

At Yakaz, Erlang has a fundamental importance. We use it, among other things, and mainly for our cluster of webservers and our chat service. Many OTP applications share the same Erlang VM with (sometimes complex) dependencies between them.

As much as possible, we try to do hot upgrades of these applications. But, sometimes, we need to shut down a node. This is done using the OTP design principles and it must be safe: no data lost, no error thrown.

For example, when a node is stopped, our http service must close its listening sockets to not accept new requests, close all waiting TCP connections and reply to all HTTP requests in progress.

During our work, we found some corner cases with OTP supervisors. In this article, we are going to explain these cases and the solutions found to work around it.

These OTP design principles in general and the supervisor behaviour in particular are explained in the OTP Design Principles User's Guide. The readers must be familiar with these notions in order to fully understand this article.

Infinite timeout to shut down worker processes

The first problem we encountered is about a limitation of the supervisor behaviour. Children attached to a supervisor are themselves supervisors or workers. For each of them, a Shutdown strategy is defined. This is a part of their child specification and it defines how they should be terminated.

An integer timeout value means that the supervisor tells the child process to terminate by calling exit(Child, shutdown) and then waits for an exit signal back. If no exit signal is received within the specified time, the child process is unconditionally terminated using exit(Child, kill).

If the child process is another supervisor, it should be set to infinity to give the subtree enough time to shutdown.

This is not very explicit here, but the Shutdown value can (and should) be set to infinity for supervisor children only. This is forbidden for worker children. This is a crude limitation.

To do some cleanup when a worker process is stopped, we must define an upper bound timeout to execute it. When it is possible, this is the best solution and all is fine. But if not, because we are not able to use an infinite timeout to shut down a worker process, we must find a workaround.

Given the circumstances, there are 3 solutions to solve this problem:

The good: Stop concerned workers by hand. With this method, it's easy to properly shut down these processes. For an OTP application, this can be done in the prep_stop function of the application callback module. But there are 2 drawbacks. Firstly, Restart strategy for processes in question must not be set to permanent[1]. Secondly, The responsability to stop these processes falls to the developper (which is messy) and no more to the supervisor.

The bad: Set a very high timeout for want of anything better. This is not elegant and this has the taste of defeat. But it works almost everytime.

The ugly: Declare concerned processes as supervisors instead of workers. There is no side-effect (as far as we know) and we can set an infinite timeout. But, there is no guarantee that this will be always so. This is a crafty way (and honestly, an ugly way) to solve our problem but it serves the purpose.

As we have just seen, there is no proper and general solution to solve this problem. There is no evident reason for this limitation. The cleanest solution we found at Yakaz was to patch the supervior module to remove this restriction. You can find our patch on GitHub (Diff view). It was submitted to Erlang/OTP team, by hoping it will be accepted.

Shutdown dynamic children for simple_one_for_one supervisor

Another problem, more tricky, with superviors is about simple_one_for_one supervisors.

simple_one_for_one supervisors are used to manage child processes dynamically instanciated. All these children share the same child specification. this is handy to implement connection handlers: everytime a new connection is accepted, we can start a new child to manage it.

But it exists a subtle corner case with this supervisor's type: Dynamic child processes are not explicitly killed by the supervisor when it is shut down.

The official documentation says:

Important note on simple-one-for-one supervisors: The dynamically created child processes of a simple-one-for-one supervisor are not explicitly killed, regardless of shutdown strategy, but are expected to terminate when the supervisor does (that is, when an exit signal from the parent process is received).

Because child processes are linked (in the Erlang sense of the word) with their supervisor, when this last one dies, then dynamic child processes receive an exit signal from it and leave. All is fine as long as we stop simple_one_for_one supervisor manually. But, if it happens when we stop an application, after the top supervisor has stopped, the application master kills all remaining processes associated to this application[2] including leftover dynamic children. So, dynamic children that trap exit signals can be killed during their cleanup. This is unpredictable and highly time-dependent.

Let's explain this behaviour in detail with an example. Here is our supervision tree:

App ---> TopSup ---> Sup ---> [SimpleOneForOneWorkers]
|
SomeWorker

Suppose that:

[SimpleOneForOneWorkers] are implemented using the gen_server behaviour and they trap exit signals

No brutal_kill shutdown strategy is used

If Sup is shut down by calling supervisor:terminate_child(TopSup, Sup) by hand, TopSup will tell Sup to termiante by calling exit(Sup, shutdown). Once Sup is dead, all [SimpleOneForOneWorkers] receive an 'EXIT' message from it. Because they trap exit signals, Worker:terminate/2 function is called with Reason=shutdown and, in turn, they die. This is the good case.

However, instead of shutting down Sup by hand, if we stop the application with application:stop(App), this will stop TopSup. During its termination, it will stop SomeWorker and Sup, then dies. At this time, TopSup, SomeWorker and Sup are fully stopped and [SimpleOneForOneWorkers] are stopping (some may be already stopped, some not). The last step is the application master killing all workers that were not dead yet, in the middle of their cleanup.

This behaviour is very troublesome and hard to debug. An easy way to fix this problem can be found in Agner. It uses the ApplicationCallback:prep_stop(State) function to fetch a list of all the simple_one_for_one workers, monitors them, and then wait for all of them to die in the ApplicationCallback:stop(State) function. This forces the application master to stay alive until all of the dynamic children died.

This solution is elegant and has no technical drawback. But it must be done for all simple_one_for_one supervisors. This is painful and error prone. Nevertheless, we can live with that.

The main problem here is that it is a breach of promises made by the supervisor behaviour. Its purpose is to start, stop and monitor its child processes, restarting them when necessary. There is no reason to deal with dynamic child processes in a different way than other child processes. it can be seen as a bug or a lack of consistency. So, again, we have decided to patch the supervior module. You can find our patch on GitHub (Diff view) and it was submitted to Erlang/OTP team, by hoping it will be accepted.

[1] Because processes are stopped manually, out of the supervisor's scope, if Restart strategy is set to permanent, the supervisor will try to restart it.
[2] i.e. all processes with the application master as group leader