I was called to look at a server today after a customer was reporting that connecting with ssh was slow and executing commands was also slow (with some not working at all).

After logging in I could type promptly so I didn't think it was a network issue like delay or bandwidth saturation (as I find this tends to be directly relatable through your ssh experiences). I first tried to run top, after a minute of nothing happening I cancelled this operation with CTRL+C. The prompt was hanging waiting for top to start up.

free -m also was just hanging at the prompt for a minute or longer before I cancelled that.

df -h did execute, and showed me that there was 60% of disk space free (I was wondering if some application had gone bananas and filled up the disks with logs).

dmesg wouldn't execute either.

I executed tail -n 50 /var/log/message and sadly I no longer have the output but it looked like there had been a serious problem. Lots of memory locations printed in HEX and presumably their contents (incomprehensibly ramblings) on the right. It was very similar to the output in this log I found on Google, trying to find a similar example, except that in the right hand column most of the lines contain "ext4" in them, perhaps there was a file system error?

Running tail -n 50 /var/log/syslog I saw in the middle of all the memory madness that was repeated here a couple of lines that said works to the effect of Info procname:pid blocked for more than 120 seconds.

I executed ps aux and looked through the output until I found one process with 299% cpu usage;

So this process has gone bonkers it seems but I can't execute any command (with or without sudo) that are related to memory. For example free -m, or top. I could cat /proc/meminfo and see that there was about 5GB out of 40GBs of RAM free.

I tried kill PID but after a couple of minutes of hanging I gave up. I tried kill -9 PID but again, same thing. I can only assume this process was so busy that it couldn't answer kill messages from the kernel? I tried renice 19 PID and kill -9 PID but this didn't work either, renice would run, just hang.

In the end a hard reboot was required which was not ideal. Files are now corrupt etc due to the specialist applications on the server. What other options did I have?

Is there no way to simply cease a process? Rather than sending a SIGTERM, just flat out cease the processing of code, or similar?

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

1

If the process is in the running state (R), kill -9 kills it immediately ­— there's no such thing as “too busy”. If the process is inside a system call (state D), kill -9 kills it (there's nothing stronger), but this can sometimes take a tiny amount of time for the system call to reach a cancellation point. If kill -9 takes a noticeable amount of time or doesn't work at all, it's a kernel bug, possibly triggered by a hardware failure.
–
GillesNov 11 '12 at 0:58

One interesting detail here would have been to look if kswapd was at 100% CPU or not. Because in 2012 there was a kernel bug in 2.6 type kernels which, when the race condition was met, made kswapd go 100% CPU plus allow processes to hang, unkillable, with 100% CPU in kernel, too. An example script which was able to detect this state is at gist.github.com/hilbix/5264057
–
TinoNov 7 '14 at 16:44

2 Answers
2

I executed tail -n 50 /var/log/message and sadly I no longer have the
output but it looked like there had been a serious problem. Lots of
memory locations printed in HEX and presumably their contents
(incomprehensibly ramblings) on the right.

It could have been nearly anything, and the contents of these kernel dumps would be important to knowing what it was.

For example, you could have had a hardware problem, like a disk that was no longer responding to requests. Trying to run programs that were already cached in RAM could work fine, while running programs that needed to read from the disk could hang.

It could also be that you hit a kernel bug, or some other driver problem, or had a bad bit flip in your RAM, or had virtually any other bad hardware. If a driver locked a particular resource in the kernel, and then hit a bug or error and failed to properly unlock it, then any other driver or system call that tries to obtain that lock would simply hang.

It may not be a bug in the kernel. You can get this sort of behavior when e.g. using the lvm or dmsetup tools to manage disks. They can both can suspend a device, which has the result that "any further I/O to that device will be postponed for as long as the device is suspended". Programs that then try to access that device will simply block in the kernel. You could trigger this manually with "dmsetup suspend", or I've seen a disk left in suspended state by accident when the LVM tool encountered an error.

If this is a one time thing, don't sweat it. If it happens again, try to carefully note the kernel output so you can track down its cause. The first crash dump will be the most important. If it happens a lot and you can't get the output, consider using a netconsole to send the kernel output directly to another machine.

Thanks for a great write up. I am thinking what you are writing, in that is could be a whole plethora of things but thanks for the idea about netconsole. Also you post have given me some homework to do on hardware faults etc. Thanks!
–
jwbensleyNov 11 '12 at 11:54

Maybe I'm confused about how it works, but a process "not having enough resources to respond to a signal" doesn't really make sense, it should be forced to handle it on its next timeslice, and it doesn't need to handle SIGKILL at all, that's kind of the point of that signal
–
Michael Mrozek♦Nov 8 '12 at 15:52

"and just wait for it to get enough resources to process was the right answer" - This was tried a couple of times, or the longer time I waited about 3 minutes. I think at that point its likely not going to work.
–
jwbensleyNov 11 '12 at 11:51

+1 for the OOM suggestion, that has given me some food for thought.
–
jwbensleyNov 11 '12 at 11:51