I have a server that has been running for well over 5 months and suddently it stop responding. I couldn't ssh into it or anything else so I decided to reboot it and the reboot fixed it.

I'm trying to figure out what happened and I'm not sure exactly where to look. I started to look in /var/log but there are tons of files in there and I'm not sure which one I should pay attention to. I'm slowly going through each one of them but if anyone can point me in the right direction, it would be great.

3 Answers
3

I'd start with /var/log/messages, which is going to be where most generic output defaults to. It will include boot messages and any kernel warnings. Depending on the type of issue, there may be no forensic data remaining. For example, RAM may not produce errors. Disk errors will be in the logs.

SSH might have simply broke. Without knowing status at console, it's difficult to say definitively. Typically, an otherwise stable Linux box that hasn't been changed suddenly locking up would example a hardware issue. Most hardware issues require further troubleshooting and diagnostics.

If you can provide more details, I will likely be able to give you further recommendations.

Hi Warner, I looked in /var/log/messages to see the log right before I rebooted the machine and there is nothing that would indicator something went wrong. I am running the server on Amazon EC2 so it might be possible something broke and my server was affected. I checked the disk free space and I am barely using 20%. Let me know what kind of details you need and I will do my best to provide it :) Thanks for your help!
–
CerimApr 24 '10 at 4:12

Amazon EC2 eliminates most potential hardware scenarios. I'd start looking at the daemons that run on the system. Apache logs, et cetera. It helps to run historical graphing-- you might look at sar. Any anomalies, anything that looks out of place. Chances are, it may be near impossible to isolate unless it recurs or you find evidence now.
–
WarnerApr 25 '10 at 2:17

It is possible though that an instance hangs if the machine on which my instance is running have a hardware failure. There is nothing is the log, and the fact that it never happened before and it didn't happen again (yet) leads me to believe its more likely to be a hardware failure ... I'll monitor the instance closely and wait until it happens again
–
CerimApr 25 '10 at 13:21

I have something like monit installed. All the things I was monitoring stopped. I block PING so I wasn't to ping it but I wasn't able to access SSH or other other services running on the machine. The more I think about it, the more I think it could be a hardware issue...
–
CerimApr 24 '10 at 6:50

It happened once, I have multiple servers running and only this one was affected. I never had this issue before. The output of /var/log/messages before the reboot was a syslog-ng entry, that's the only thing in the log. syslog-ng[1774]: Log statistics ... syslog-ng[1774]: Log statistics ... shutdown[801]: shutting down for system reboot
–
CerimApr 25 '10 at 13:17