A bit earlier, I was helping Yurii Rashkovskii (yrashk) with his code for Agner (https://github.com/agner/agner) and he tripped on a pretty interesting corner case. I'm writing this both to document what we found (and how we solved it) and to ask a few questions about the reasons for things to be this way (skip to the end if you don't want the explanation).
His application is basically structured in a way where the application is started, starts a top-level supervisor, which starts a server and another supervisor, which in turn spawns the dynamic children (hoping this looks right in the email):
App ---> TopSup ---> Sup ---> [SimpleOneForOneWorkers]
|
SomeWorker
[url just in case: http://ideone.com/pr0vc]
Now the thing is that the documentation says the following: "Important note on simple-one-for-one supervisors: The dynamically created child processes of a simple-one-for-one supervisor are not explicitly killed, regardless of shutdown strategy, but are expected to terminate when the supervisor does (that is, when an exit signal from the parent process is received)."
And indeed they are not. The supervisor just kills its regular children and then disappears, leaving it to the simple-one-for-one children's behaviours to catch the exit message and leave. This, alone is fine.
Next thing we have is the application itself. For each application, OTP spawns an Application Controller (AC) which acts as a group leader. As a reminder, the AC is linked both to its parent and its direct child and monitors them. When any of them fails, the AC terminates its own execution, using its status as a group leader to terminate all of the leftover children. Again, this alone is fine.
However, if you mix in both features, and then decide to shut the application down with 'application:stop(agner)', you end up in a very troublesome situation:
App --> TopSup (dead) --> Sup (dead) --> [SimpleOneForOneWorkers]
|
SomeWorker (dead)
[url just in case: http://ideone.com/KklZ8]
At this precise point in time, both supervisors are dead, as well as the regular worker in the app. The simple-one-for-one (SOFO) workers are currently dying, each catching the 'EXIT' signal sent by their direct ancestor.
At the same time, though, The AC gets wind of its direct child dying (TopSup) and ends up killing every one of the SOFO workers that weren't dead yet.
The result is a bunch of workers which managed to clean up after them, and a bunch of others that didn't manage to do it. This is highly timing dependent, hard to debug and easy to fix.
Yurii and I basically found two fixes for that one. The first one is to simply make the SOFO workers transient and kill them beforehand (which is messy). The second one is to use the 'ApplicationCallback:prep_stop(State)' function to fetch a list of all the dynamic SOFO children, monitor them, and then wait for all of them to die in the 'stop(State)' callback function. This forces the application controller to stay alive until all of the dynamic children died.
You can see the actual implementation here: https://github.com/agner/agner/blob/db2527cfa2133a0679d8da999d1c37567151b49f/src/agner_app.erl
--
Now for the questions:
- Is there any reason for simple_one_for_one supervisors to use this kind of asynchronous scheme when it comes to terminating? We wouldn't have had any bug at all if it were possible to call for a synchronous termination. Is this due to any legacy reason or because of general performances for larger SOFO supervisors?
- Is there any way someone can think of to document this kind of error? We realise this really is a corner case, but it's a known one (at least now it is) and it could be useful to read about it somewhere without having to do the big logical leap between the application controller (which we needed to dive in the source) and the supervisor of simple_one_for_one workers terminating behaviours.
- Can any of you think of a cleaner way to shut things down than what we used?
Thanks for reading,
--
Fred Hébert
http://www.erlang-solutions.com