So you have this neatly setup unix server and it's super fast and works swell and everything is great for months, and suddenly all kinds of weird errors start showing up for a variety of different services and none of them make a lot of sense on their own, much less together.

What are cheap things you should check as soon as you get your ssh session to the machine?

I'm specially interested in trauma stories that highlight non-obvious commands and rare situations, but I guess what's obvious varies from person to person, so we can just list them all freely.

12 Answers
12

If you can't log in, there's bigger problems afoot. This generally comes in two flavors: hardware failure, and software failure. Both are potentially catastrophic. To prevent DFA errors, check the general hardware health first - a simple glance-over usually will suffice.

Second Order: Are the system's underlying structures in good health and order?

Check the "Golden Triad" of systems:

Enough CPU time is free for processing

Enough disk space is free for storage

Enough memory is free for workloads

In the last few decades, the triad has expanded into a "quad" which includes communications (networking):

Connectivity is functional, responsive, and has capacity

Third Order: What is the severity of the issue?

What programs or services are affected? In decreasing order of severity, is it systemic (system-wide), clustered (a group of programs), or isolated (a specific program)? Clusters of programs typically are tripping up because a specific underlying service has failed or gone unresponsive. Systemic issues are sometimes related to this (think DNS or IP conflicts) but knowing where to look is usually the key.

Fourth Order: Are diagnostic tools providing useful data relevant to the issue?
Now that you have info about the health of the system (second order) and what parts of it are experiencing issues (third order) this should make it easy to narrow down where the problem is.

Error messages or log files should be a common waypoint on this journey.

CPU issues:

loadav

top

strace

Disk space / I-O issues:

df

du

lsof

iostat

vmstat

Memory issues:

free

Connectivity issues:

ping

route (and arp and rarp and friends)

iptables, ipchains, ipfw (for those BSD folks out there)

traceroute or mtr

hosts, nslookup, or dig

netstat

Most common complaint (that I hear):

Email is not delivering fast enough (more than a minute from send to receipt by recipient) or, email is rejecting my attempt to send. This usually comes down to the rate limiter in Postfix kicking in during a spam-storm, which impacts the ability to accept internal delivery.

A real-life example:

However, this is not always the case. One time, the issue persisted regardless of service restarts; so after 3 minutes it was time to start looking around. CPU was busy but under 100%, yet the load had soared to 15 on a box of just 2 cores, and was threatening to go higher. The top command revealed that the mail system was in overdrive, along with the mail scanner, but there were no amavis child processes to be seen. That was the clue - the mail queue command (mailq) showed some 150+ undelivered messages, over 80% of which were spam, in the last 20 minutes. A quick adjustment to lower the rate limiter (which reduced the intake rate of the spam storm) while increasing the number of child email scanner processes (to help process the backlog), followed by a service restart, resolved the issue and the system was able to complete deliveries in a short time.

The cause of the problem was that the amavis parent process had keeled over dead, and the child processes had eventually all run their course (they self-terminate after so many scans to prevent memory leaks). So there were SMTP processes in postfix attempting to contact...thin air...to do the spam/virus scanning that was needed. The distro I was using had out-of-date packages that would never be updated; as the installation was due to be replaced in a year or so, I manually "overrode" the install to the latest version, which included several bug fixes. I haven't had the same problem since.