As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

What is your backup solution? Does it cover bare-metal restore? Is it tested? All of these things are more important than log watching.
–
kmarshSep 16 '09 at 12:05

7 Answers
7

My favourite two things that are easy to overlook and give you hell are:

Time. A bad clock can give you lots of problems. This is especially problematic with virtual machines

Full disk. This can provoke from strange issues to not being able to log in.

Keeping your system patched should be easy. I suggest you use "stable" distros with long term support (meaning you get no updates to software beyond security patches and major bug fixes). These mean that their package manager 'update all' operation will probably be smooth and easy. You should also subscribe to the distros security mailing list and evaluate all messages concerning software you have installed.

You should also audit all means of entrance to the box, make sure that there are no unnecessary net-accessible apps running and that necessary net-accessible apps are properly secured (i.e. use encryption as necessary and have strong authentication).

Does CentOS fit this bill? I currently use RHEL, but I wouldn't want to necessarily spend the extra money to license it if CentOS works just as well.
–
Abraham VeghSep 13 '09 at 16:28

1

Well, CentOS is nearly identical to RHEL, but updates have a certain delay, of course you don't get the support and other companies might not support it (i.e. Dell has 'unofficial' support). It depends. Of course if you have the budget and want to have corp support, go RHEL (or whatever distro with corp support), but CentOS is quite ok. Myself, I'm a Debian person, though, although I have adopted CentOS in my company and in my VPS.
–
alexSep 13 '09 at 18:42

Check your hardware. And keep checking your hardware while the server is running. Most Linux crashes are actually caused by hardware.

For each hard disk, use smartmontools to check it's S.M.A.R.T. status. Before using the disk, run a long self-test (takes around 1 or 2 hours):

# This command starts the test
smartctl --test=long /dev/sda
# This one to show the test status
watch -n 120 smartctl --log=selftest /dev/sda

Also, keep the smartd daemon running and pay attention to the logs.

Use redundant disks. Don't trust only one hard disk. Not all errors can be caught by S.M.A.R.T., and I say this from my own experience.

Also install mcelog and keep watching /var/log/mcelog. This tool logs Machine Check Exceptions, which are exceptions caught by the CPU. A normal, healthy system should raise no MCEs. Thus, if you see any MCE logged, there is something wrong (maybe overheating).

Firstly, setup something like logwatch, and apticron to email you with log file reports, and lists of packages out of date. Log in whenever apticron emails you and apt-get update && apt-getupgrade (I'm assuming a debian derivative. The other linuxes probably have equivalents of apt-cron though). If you know where you'll be connecting from, you should block ssh from anywhere else bar that IP. If not, installing something like fail2ban should stop brute force attacks anyway.