Conversation

This PR was originally in #1013 but wasn't quite on topic there, hence was broken out.

After the reactor exits, the broker sets up a SIGALRM handler in case its last ditch attempts to unload modules hangs. This signal handler calls exit() which causes any registered atexit() functions to run, including one to clean up temporary files and zsys_shutdown() to close dangling zeromq sockets. This latter one is installed automatically through the use of various libczmq classes.

zsys_shutdown() has an internal mutex that would appear to only protect against concurrent calls to itself, not against calls to other socket operations. Since zeromq sockets are not thread safe, this can lead to segfaults. This became evident in PR #1013 when converting from zsocket to zsock classes.

To avoid calling this function from the SIGALARM handler where comms modules may still be shutting down, replace the call to exit() with _exit(). We do however still want to clean up temporary files, so expose cleanup_run() and call it manually from the signal handler.

The unlink_recursive() function called by cleanup_run() is generally useful and has no tests, so break it out to its own libutil source module and add a test. Fix a bug discovered by the test.

This comment has been minimized.

After you rebase this one on master, I want to doublecheck coverage. The current report strangely shows some unrelated functions are now missed (e.g. in aggregator and cron), which can sometimes mean testsuite wasn't fully run, or perhaps just a transient problem with codecov.

This comment has been minimized.

Looking through the coverage travis run, I do see a fair number of warnings that the broker exited through the SIGALRM timeout. Hmm, if gcov collects its results using an atexit() function, then maybe it fails to collect results when the SIGALRM handler calls _exit()?

This comment has been minimized.

Looking through the coverage travis run, I do see a fair number of warnings that the broker exited through the SIGALRM timeout. Hmm, if gcov collects its results using an atexit() function, then maybe it fails to collect results when the SIGALRM handler calls _exit()?

That is a very good theory... Hm, I don't have any good ideas at the moment.

This comment has been minimized.

Yeah, I poked at calling __gcov_dump() in the signal handler, but the function isn't defined if not building with gcov and I don't see a preprocessor symbol that would allow it to be easily compiled conditionally. I guess I could add one.

I think in addition to that, I need to figure out how to reduce the SIGALRM timeouts, since that indicates that (probably) the module unloading in rc3 was cut short because the shutdown grace is too tight for the code when it's running under instrumentation. That also could affect coverage and since it's a timing thing, might contribute to coverage instability. The kludgey way to work around this would be to conditionally increase the timeouts, but a much better way would be to fix #1006...

Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.