I have 15 identical Linux RH 4.7 64-bit severs. They run cluster database (cluster is application level). On occasion (every month or so) a random box (never the same though) freezes.

I can ping the box and ping works. If I try to ssh in the box I get:

ssh_exchange_identification: Connection closed by remote host

SSH is set up properly.

When I go to the server room, and try to login directly to console, I can switch consoles with Alt+Fn, I can enter a username, and characters do show, but after pressing Enter, nothing happens. I waited 8 hours once and it didn't change.

I set up syslog to log everything to a remote host, and there is nothing in those logs. When I reboot the machine, it works without a problem. I have run HW tests - everything is ok, and nothing is in the logs. The machines are also monitored with NAGIOS, and there is no unusual load or activity prior to freeze.

What hardware tests did you run? What tools did you use?
–
TshepangFeb 23 '11 at 12:08

HW is HP proliant, I used their util to check RAID status normal smart tools do not work, and I used memtest to check memory. I am having this problem for several months, and its never the same server.
–
Luka MarinkoFeb 23 '11 at 12:26

4 Answers
4

It sounds like your kernel panicked in some way such that sshd couldn't send the server keys. Possibly, the kernel was wedged in such a way that the network stack was still up, but the vfs layer was unavailable.

When I experienced similar problems on a RHEL4 system, I set up the netdump and netconsole services, and a dedicated netdump and syslog server to catch the crash dumps and kernel panic information. I also set the kernel.panic sysctl to 10. That way, when a system panics, you get both the kernel trace and a copy of the memory on that system, to which you could analyse with the 'crash' utility.

You would certainly also benefit from setting up a serial console for the hosts, so you could see the console out put and potentially hit the magic sysrq keys. Also, if you're willing to set up the networking and you have hardware that supports it, you can use IPMI to remotely poweroff,poweron,restart, and query the hardware.

(for what it's worth, RHEL5 has a similar functionality with kexec/kdump, only the crash dump is stored locally)

Hi, I have acces to console directly (via KVM), and there was nothing there. I could switch between virtual terminals type in my username, but that's it, also ctr+alt+del did not work, but should from console.
–
Luka MarinkoFeb 23 '11 at 18:00

Also servers have HP's ILO, I can reboot them and see staus of HW from remote. There was no error there
–
Luka MarinkoFeb 23 '11 at 18:02

Did you check the syslogs during that time? It sounds like a panicked kernel. I don't trust KVMs on my linux servers, too often the kernel panic doesn't show up on the console, or it's corrupted or just the last couple lines, that's why I prefer a serial console.
–
jsbillingsFeb 23 '11 at 18:12

1

This does not sound like a kernel panic. Console switching still works and the login program is still active.
–
mattdmFeb 23 '11 at 19:40

yes I had syslog redirected to central syslog server. There is nothing unusual in the logs.
–
Luka MarinkoFeb 24 '11 at 11:00

The only time I've seen anything similar was where a KVM switch was used and a keyboard hot-key (e.g. alt+n) was used to switch between servers. It didn't happen every time and it was the server being switched away from that was affected - so it wasn't immediately noticeable. No lock-ups would occur if a physical button on the KVM switch itself was used to switch between servers. If the hot-key was often used, occasionally a server would not allow new logins. Existing SSH sessions were unaffected.

I will bet dollars to donuts that you are running out of memory. The system is grinding to a halt as it tries to figure out where to get some from. It may be happening so quickly that your monitoring doesn't catch it. I'd step up monitoring, including remote logging of memory usage. Check in the logs for OOM messages as well.

To me this sounds like the system is out of resources so the process needed by the server side of ssh cannot be allocated.

The actual bottleneck can vary - out of processes or out of memory - and the only way to be sure is to look at the logs and console to see if anything is present there. You may want to set up a scenario of pre-started ssh-jobs - one to each machine - simply to be prepared next time it happens.

If it is really bad, then you may want to consider starting another shell with more built-in commands so you can investigate more without having to start an extra process as this may not be possible. Also "tail -f /var/log/*" may be very useful.