Thursday, December 17, 2009

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime.

Most of these mistakes are from my early days as a UNIX admin.

userdel Command
The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed.

I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):

# userdel foo

Rebooted Solaris Box
On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes.

As root I killed all process, this was our main Oracle db box:

# killall process-name

Destroyed named.conf
I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:

# ./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)
I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):

# cd /mnt/bacupusbharddisk
# tar -zcvf project.tar.gz functions

I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I’ve switched to rsnapshot)

# rsync -av -delete /dest /src

Again, I had no backup.

Deleted Apache DocumentRoot
I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Accidentally Changed Hostname and Triggered False Alarm
Accidentally changed the current hostname (I wanted to see current hostname settings) for one of our cluster node.

Within minutes I received an alert message on both mobile and email.

hostname foo.example.com

Public Network Interface Shutdown
I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:

# ifconfig eth1 down

Firewall Lockdown
I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules.

After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box
I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):

# halt
# service httpd stop

Wrong CNAME DNS Entry
Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.

Use CVS to store configuration files.

Wait and read command line again before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.