Currently, I'm the Moodle Admin for my University, and one of my responsibilities is to keep the server running and working all the time. Sometimes, for no apparent reason, Apache or / and MySQL crashes, causing total chaos within the University.

In a broad sense, what are the "basic guidelines" to follow when a server crashes? What should I do at first to find out what happened? How do I know how many users were connected at the time of the crash (or at any given time)? How do I know how much memory or power I need for the current demand?

There are many questions related to each other, but these are the most important. Obviously, I'm far from being an experienced Sysadmin. I know my way around Linux a bit, if that helps.

Our server specs:

Intel Dual-Core Xeon @2.66 GHz (if I recall correctly)

2 GB RAM

500 GB HDD

CentOS 5.4

MySQL 5.0.45

PHP 5.3.12

EDIT: Sorry for the lack of information.

I have read both Apache and MySQL logs without any significant data appearing. Apache is the most informative, saying WHEN did it crash, but no other reason. In fact, the error in Apache crashing isn't actually an "error", just the log entry of it reinitiating - worst case scenario is Apache giving "SIGTERM" or "SIGKILL". MySQL logs do not tell absolutely anything.

I usually try to follow up what's happening using "top". When the crashes happen, rarely there's all (or even half) of the system memory consumed. In really dire situations, the CPU usage has reached... 80%?

Disk and Memory usage seems fine (du and free show no problems). SSH access usually is fine. It just seems that MySQL or Apache randomly crashes, because, even when the demand is not that high, it still hangs.

The problem could be reduced to "What logs to check?" "How to check the number of connections?"

2 Answers
2

Usually you want to start by looking at the system and application logs, which may or may not reveal something. If you have of sar tools running you'll want to look at your system stats leading up to the crash.

Of course it is always good to check for obvious things like disk full, power interruptions, recent user logins (maybe someone else typed 'reboot').

Depending on the crash, you may want to look at the console to see if there is any dump information still on the display.

If you can't find anything obvious in the logs your next step might be to guess what you think the problem is and create some scripts to monitor that aspect of the system, so you can get more useful information in the future. If you think the number of connections may be the problem, then you may want to periodically collect the out put of netstat or something.

Excellent tip. I've never heard of that tool before. I should set up a cronjob using that tool or something and see what I can do. Netstat is something else I could use if filtered. I don't know how to read through the output, but it could tell me something - perhaps more than those MySQL logs.
–
AeroCrossJun 2 '11 at 19:52

Installing the sar package on most distributions will automatically set up a cronjob for you. On a Debian-based system just use apt-get install atsar.
–
ZoredacheJun 2 '11 at 20:41

Sadly, no. For what I could read, the /proc/sys/kernel/proc_pattern file (the one that should have that info dumped in) just reads "/dev/null", and the other one (core_uses_pid) has "1" in it. Don't know where to go from there. I'll have to look around and see how to activate that dump file.
–
AeroCrossJun 2 '11 at 19:49