Linux – What to do when things go wrong

Linux is our bread and butter. The CLI is where we live. We live and die by our LAMP stack.

We’ve all had Linux server failures, and I want to try and help remind our readers that when Linux fails, it is not irrecoverable. There are many reasons a server might fail, but as long as you cover some very simple rules, you will be able to get back up and running. All it takes is a little extra time, and some problem solving skills.

If you’re reading this, then we can assume you already have the latter of the two.

Most Linux blogs or forums might tell you how to repair specific problems, but the issue is more about howto think about fixing those issues.

These rules are from my own personal experience, and may not reflect your own environment. Take it with a grain of salt, and hopefully you will find something here that can apply to your setup.

Rule #1 – Backup

This alone could be Rule 1; 2, and 3. Your backup is your savior. If you don’t backup regularly, you’re in for a world of hurt when the proverbial hits the fan. Take it from someone who knows. You don’t want to be caught with your pants down at crunch time.

If you have time to troubleshoot an issue, then you had time setup regular backups in the beginning. Please, for the love of all that is holy – Backup

Setting up a backup is a simple process, and you can use any method you think is appropriate. Using rsync you can backup your data to a remote server. Creating a small BASH script can have you remotely backing up your data over FTP quickly. Don’t forget that you can also use cron to do all the scheduled work.

One thing to remember, is to not keep your backup on the same machine that is running live. If you do, and for some reason it fails, then it will take the backups with it.

If you’re lazy, do a complete machine backup. This is easier with a virtualized environment, but restoring a machine with a couple of clicks may be preferable.

Rule #2 – Reboot

I cannot tell you how often this has gotten me out of a severe jam.

When all else fails, reboot your machine. It’s surprising how often this will successfully restart the services that have failed. A lot of services fail after being patched or upgraded. Usually, this can be attributed to the config files being altered in the new build, or left over cache files etc.

The installation script for most applications or services will usually take care of this, and restart the service with the new configs. On occasion, the script doesn’t, and will need a quick reboot to make the switch.

Also, it gives the machine a chance to refresh any resident data in the RAM, which can sometimes allow a service to restart successfully.

Rule #3 – Read the Linux logs

If your server has fallen over, and you don’t know where to start – The Logs.

Always the logs. This is where you should always start your troubleshooting process after getting to your command line. You will be like Sherlock Holmes, and the Logs are where all of your clues are held.

The clues will not always be obvious, but they are there. Buried in the Logs you can ferret out some very interesting information about what could potentially be wrong with your machine.

The Logs are laid out in a very logical fashion, and you should immediately be able to narrow down where to look. If you have an issue with Apache, you’re not going to need to look at the postfix log – yet.

Assuming that Apache is faulting, here’s a neat little trick. Run this in your terminal: tail -f /var/log/apache2/error.log

It will give you a live scrolling display of the Apache error log.

Now, all you need to do is replicate the error. You will be able to see the output in real time. This way, you can start to narrow down the issues, and get to fixin’!

Rule #4 – Google is your friend

I’ve been blasted for this in the past, but the reality is that there are too many applications and potential error messages for a single person to retain every single outcome of an error, and what the error message will mean.

While Rule #3 may help us to find out what the error is, it doesn’t tell us what the error means.

The chances of someone else having had the exact error you have, is extremely high. If not the exact error, then something very similar.

Using Google can really help us find out if an error is relevant, or just system log noise.

Search for the exact phrase of the error, and remember to strip any custom paths or hex messages, as they are usually only relevant to your system, and will interfere with your search results.

Make sure that the sources you are getting results from are good quality. Sources like AskUbuntu and StackOverflow are usually good. Tech blogs can also be helpful. Avoid forums, unless you get no other results, or they are relevant to your issue (for example Apache forums for an issue with Apache).

Rule #5 – Reinstall

Before you freak out, I don’t mean reinstall your OS. What I’m pointing to is a quick reinstall of your application. This isn’t always a simple task, but if you’re using prebuilt binaries, a simple reinstall with your favorite package manager should be enough.

3 thoughts on “Linux – What to do when things go wrong”

As well as searching for help I’ve found that the applications also have inbuilt tools to help you, so with Apache the ‘t’ option reports on the syntax of the configuration, e.g. ‘httpd -t’ on Windows, which I imagine Linux is similar.

Another thing I’ve learnt is only to use ‘su’ / ‘sudo’ when you really need to, as the root user can destroy a disk with a single command and make the machine unuseable.

With configs another option is version control (such as Git) as they are text files, so as they change over time you can keep a ‘journal’ of the changes and easily go back to a known version that works and see the differences between versions.

Timeshift is essentially a GUI front end for rsync. If you have a machine that uses a DE (Desktop Environment), then it is definitely a valid option.

With Apache, you can also run `apache2 -f /etc/apache2/sites-enabled/000-default.conf` and you can get a live output. It’s not as simple as `tail`, and doesnt output as verbose information, but it can be easier to remember.

`su` and `sudo` have the ability to do damage to your machine, but in the same way that a Windows `Administrator` has the ability to damage a Windows machine. They are roughly equivalent.

That’s a great post – and reminds even people like me (administering many Moodle sites daily) that there are different ‘levels’ of problems, and the ‘correct’ solution will vary tremendously from one situation to another. So glad you stressed backups … it almost makes me cry when I see people posting on Moodle.org and elsewhere that their Moodle site is broken, and they don’t have course / site backups. Honestly, if you can’t guarantee 100% that you can restore a server from a complete meltdown, you simply shouldn’t be the one in charge of hosting your Moodle site. It’s like driving across Australia, on your own, and not knowing how to change a flat tyre.