I've spent hundreds of hours trying to track down the cause of a process that mysteriously terminates at random intervals on 64-bit CentOS 5, and I'm so far no closer to understanding why. We've looked at the OOM killer, looked at every possible log file, done deep postmortems on the server after the event, added debug code to trigger a core dump on any "unusual" termination event, etc.

The process in question starts normally, and will often run on a given server for long periods of time (days, weeks, sometimes longer), but at random intervals on random servers, with no apparent cause-and-effect, it will simply exit. No logs, no core file, no nothing.

I'm not sure what to do next--hoping to get some ideas for troubleshooting this that I haven't thought of.

Can you divulge what the process actually is, name and version might help? Does it log anything in it's own logs or syslog?
– ServerMonkeyJan 7 '15 at 0:01

It's one process from a third-party app; I'd rather not disclose what it is, but I'm not sure that's important. It does not log anything (other processes in this app do, but those logs have proven to be useless). The vendor is just as baffled as we are.
– ericJan 7 '15 at 4:25

Just to make sure: is "ulimit -c" set (e.g. "unlimited") in the context where the process runs?
– toniocJan 7 '15 at 12:12

Yep; ulimit -c is standard for us. But no cores are ever generated by this event. I've thought for a time that something is causing it to terminate "gracefully", the trick is finding out what.
– ericJan 7 '15 at 13:41

1 Answer
1

Run strace or ltrace on the process. You can capture all the output in a log file, or filter it so you capture only what you need. You can use the -e flag to extract only what you are interested in. strace and ltrace will show you which signals are intercepted, what the process is doing and which system calls were made at the time the process terminates.

strace especially looks very interesting--thanks for suggesting it. My only immediate problem is that the output is too verbose. I see that there are ways to filter down to specific "categories" of call, but I need to omit specific calls; not sure how to do that yet.
– ericJan 7 '15 at 5:00

Aha! I figured out the subtleties of the -e flag; running now with all the "noise" eliminated. Now we wait...
– ericJan 7 '15 at 5:29

I'm not sure if this is the answer yet, but after several hours, we got one of the random exits, and strace shows a segmentation fault first before the process exited. We are investigating further now, but that's the best clue we've gotten in a long time on this issue. I'm puzzled as to why this wouldn't trigger a core dump?
– ericJan 7 '15 at 21:32

Verify Linux is set up for core dumps. Can you write a small test program and make sure it dumps core on segfault?
– Michael MartinezJan 7 '15 at 22:35