I forgot to mention, i tried to set the odls_base_sigkill_timeout as you told, even 5s was not sufficient for the root execute it's task, and most important, the kill was instantaneous, there is no 5s hang. My erroneous conclusion: SIGKILL was being sent instead of SIGTERM.

Which version are you using? Could be a bug in there - I can take a look.

About the man page, at least for me, the word "kill" is not clear. The SIGTERM+SIGKILL keywords would be unambiguous.

Thank you for your prompt reply. I confirmed what you just said by reading the mpirun man page at the sections Signal Propagation and Process Termination / Signal Handling.

"During the run of an MPI application, if any rank dies abnormally (either exiting before invoking MPI_FINALIZE, or dying as the result of a signal), mpirun will print out an error message and kill the rest of the MPI application."

If i understood correctly, the SIGKILL signal is sent to every process on a premature death.

Each process receives a SIGTERM, and then a SIGKILL if it doesn't exit within a specified time frame. I told you how to adjust that time period in the prior message.

In my point of view, i consider this a bug. If OpenMPI allows handling signals such as SIGTERM, the other processes in the communicator should also have the opportunity to die prettily. Perhaps i'm missing something?

Yes, you are - you do get a SIGTERM first, but you are required to exit in a timely fashion. You are not allowed to continue running. This is required in order to ensure proper cleanup of the job, per the MPI standard.

Supposing the described behaviour in the last paragraph, i think would be great to explicitly mention the SIGKILL in the man page, or even better, fix the implementation to send SIGTERM instead, making possible for the user cleanup all processes before exit.

We already do, as described above.

I solved my particular problem by adding another flag unexpected_error_on_slave:

Note the slave must hang for the store operation get executed at the root, otherwise we back for the previous scenario. It's theoretically unnecessary send MPI messages to accomplish the desired cleanup, and in more complex applications this can turn into a nightmare. As we know, asynchronous events are insane to debug.

Well, yes and no. When a process abnormally terminates, OMPI will kill the job - this is done by first hitting each process with a SIGTERM, followed shortly thereafter by a SIGKILL. So you do have a short time on each process to attempt to cleanup.

My guess is that your signal handler actually is getting called, but we then kill the process before you can detect that it was called.

You might try adjusting the time between sigterm and sigkill using the odls_base_sigkill_timeout MCA param:

mpirun -mca odls_base_sigkill_timeout N

should cause it to wait for N seconds before issuing the sigkill. Not sure if that will help or not - it used to work for me, but I haven't tried it for awhile. What versions of OMPI are you using?

On Mar 22, 2012, at 4:49 PM, Júlio Hoffimann wrote:

Dear all,

I'm trying to handle signals inside a MPI task farming model. Following is a pseudo-code of what i'm trying to achieve:

As can be seen, the signal handling is required for implementing a restart feature. All the problem resides in the assumption i made that all processes in the communicator will receive a SIGTERM as a side effect. Is it a valid assumption? How the actual MPI implementation deals with such scenarios?

I also tried to replace all the raise() calls by MPI_Abort(), which according to the documentation (http://www.open-mpi.org/doc/v1.5/man3/MPI_Abort.3.php), sends a SIGTERM to all associated processes. The undesired behaviour persists: when killing a slave process, the save section in the root branch is not executed.