My 10 UNIX Command Line Mistakes

Anyone who has never made a mistake has never tried anything new. -- Albert Einstein.

Here are a few mistakes that I made while working at UNIX prompt. Some mistakes caused me a good amount of downtime. Most of these mistakes are from my early days as a UNIX admin.

userdel Command

The file /etc/deluser.conf was configured to remove the home directory (it was done by previous sys admin and it was my first day at work) and mail spool of the user to be removed. I just wanted to remove the user account and I end up deleting everything (note -r was activated via deluser.conf):userdel foo

Rebooted Solaris Box

On Linux killall command kill processes by name (killall httpd). On Solaris it kill all active processes. As root I killed all process, this was our main Oracle db box:killall process-name

Destroyed named.conf

I wanted to append a new zone to /var/named/chroot/etc/named.conf file., but end up running:./mkzone example.com > /var/named/chroot/etc/named.conf

Destroyed Working Backups with Tar and Rsync (personal backups)

I had only one backup copy of my QT project and I just wanted to get a directory called functions. I end up deleting entire backup (note -c switch instead of -x):cd /mnt/bacupusbharddisk tar -zcvf project.tar.gz functions I had no backup. Similarly I end up running rsync command and deleted all new files by overwriting files from backup set (now I’ve switched to rsnapshot)rsync -av -delete /dest /src Again, I had no backup.

Deleted Apache DocumentRoot

I had sym links for my web server docroot (/home/httpd/http was symlinked to /www). I forgot about symlink issue. To save disk space, I ran rm -rf on http directory. Luckily, I had full working backup set.

Accidentally Changed Hostname and Triggered False Alarm

Accidentally changed the current hostname (I wanted to see current hostname settings) for one of our cluster node. Within minutes I received an alert message on both mobile and email.hostname foo.example.com

Public Network Interface Shutdown

I wanted to shutdown VPN interface eth0, but ended up shutting down eth1 while I was logged in via SSH:ifconfig eth1 down

Firewall Lockdown

I made changes to sshd_config and changed the ssh port number from 22 to 1022, but failed to update firewall rules. After a quick kernel upgrade, I had rebooted the box. I had to call remote data center tech to reset firewall settings. (now I use firewall reset script to avoid lockdowns).

Typing UNIX Commands on Wrong Box

I wanted to shutdown my local Fedora desktop system, but I issued halt on remote server (I was logged into remote box via SSH):halt service httpd stop

Wrong CNAME DNS Entry

Created a wrong DNS CNAME entry in example.com zone file. The end result - a few visitors went to /dev/null:echo 'foo 86400 IN CNAME lb0.example.com' >> example.com && rndc reload

Failed To Update Postfix RBL Configuration

In 2006 ORDB went out of operation. But, I failed to update my Postfix RBL settings. One day ORDB was re-activated and it was returning every IP address queried as being on its blacklist. The end result was a disaster.

Conclusion

All men make mistakes, but only wise men learn from their mistakes -- Winston Churchill.

Never use rsync with single backup directory. Create a snapshots using rsync or rsnapshots.

Use CVS to store configuration files.

Wait and read command line again before hitting the dam [Enter] key.

Use your well tested perl / shell scripts and open source configuration management software such as puppet, Cfengine or Chef to configure all servers. This also applies to day today jobs such as creating the users and so on.

Mistakes are the inevitable, so did you made any mistakes that have caused some sort of downtime? Please add them into the comments below.

My all time favorite mistake was a simple extra space: cd /usr/lib ls /tmp/foo/bar I typedrm -rf /tmp/foo/bar/ * instead ofrm -rf /tmp/foo/bar/* The system doesn’t run very will without all of it’s libraries……

Yes – I think I’ve made almost every possible linux mistake over the years – when I was a young sys admin I did exactly what you did – and put a space in the middle of a rm -rf /stuff/to/delete/ * I think now that the best thing is to use virtual machines, and backup those VM’s locally and remotely. It’s easy to restart a VM & roll back changes if needed.

* VM is no silver bullet (and definitely no substitute for /dev/head and /proc/care); * zsh can warn on this type of error, and its rm has additional -s option to handle “buried symlink” case; * I’ve got a habit of hitting (at least when zsh/bash3 is handy) and examining the static list instead of removing by pattern.

Michael, I don’t mean to be judgemental or start a discussion, but the idea behind this comments section (at least to my understanding) is to share experience and non-obvious mistakes in order to keep others from making them, not to discuss general ideas on how to do things like backup, etc. So please share more experience rather than correcting mistakes others have made. Cheers, Simon

I did something similar on my first day as a junior admin. As root, I copied my buddy’s dot files (.profile, etc.) from his home directory to mine because he had some cool customizations. He also had some scripts in a directory called .scripts/ that he wanted me to copy. I gave myself ownership of the dot files and the contents of the .scripts directory with this command:

cd ~jeff; chown -R jeff .*

It was only later that I realized that “.*” matched “.” and “..”, so my userid owned the entire machine… which happened to be our production Oracle database.

That was 15 years ago and we’ve both changed jobs a few times, but that friend reminds me of that mistake every time I see him.

For most of these errors above that occured in the workplace, perhaps the biggest mistake was that a senior admin or manager allowed some junior who does not know the difference between / and \ to type at a # root prompt on a valuable production server. I would not so much blame the junior, but I would suggest that the (ir)responsible senior should be fired! If my 3 year old son strangles the cat on my watch, I’m responsible!

Recent issue. We’ve switched all servers PSU1 to backup power PSU2 since all servers have redundant power units to replace main UPS with higher model. However, SAN switches do not have redundant PSU. So, We’re observed how LUN paths are switched from VMWare vSphere Client, and they were up after switching power one by one. However, storage for main Oracle DB box didn’t come back because of Windows driver failure.. Lesson learned: ALWAYS check are all LUNs back on-line, for Windows and Linux separately.

I had taken over SysAdmin of a server. The server had a cron job that ran, as root, that cd’ed into a directory and did a find, removing any files older than 3 days. It was to clean up the log files of some program they had. They quit using the program. About a year later, someone removed the directory. The cron job ran. The cd into the log file directory didn’t work, but the cron job kept going. It was still in / – removing any files older than 3 days old! I restored the filesystems and went home to get some sleep, thinking I would investigate root cause after I had some rest. As soon as my head hit the pillow, the phone rang. “It did it again”. The cron job had run again.

Lastly, I once had an accidental copy & paste, which renamed (mv) /usr/lib. Did you know the “mv” command uses libraries in /usr/lib? I found that out the hard way when I discovered I could not move it back to its original pathname. Nor could I copy it (cp uses /usr/lib).

An “Ohnosecond” is defined as the period of time between when you hit enter and you realize what you just did.

My .. incident has taught me to hit tab just in case to see what actually gets removed; BTW zsh is very helpful in that regard, it has some safety net means for the usual * ~ cases — but then again touching nothing with destructive tools when tired, especially as root, is a bitter but prudent decision.

Regarding /usr/lib: ALT Linux coreutils are built properly ;-) (although there are some leftovers as we’ve found when looking with some Gentoo guys at LVEE conference)

The rm -rf is the most common for Linux beginners that ends up dooming people. I ran into a problem where I needed to make sure the date on the servers was at least 30 seconds apart from each server in a oracle database network. I forgot to put a “.” at the end to represent the seconds so the next day my servers had a date where the year was 2048, even now my co-workers still call me lightspeed.

Sounds like my mom’s office (telecommunication). They’re relentless. That’s a pretty funny story (and thing to call you; sure it’s at your expense but it’s not really that mean, and its more clever).

I think my favourite story is something that your (Jose’s) story reminds me of. My mom fixes databases and other problems. One time there was a database issue she was debugging. She was talking to the person on the phone. The person kept reading the error he was getting to my mom (over phone). Finally, my mom realized that he had actually printed out the file right – the file is what contained an error – not the printer itself! In other words, he printed out a file that contained the error and he thought that it was the printer having an issue. But hey, he learned from it and we all make mistakes at times. Those who say you should just fire them don’t realize that if that were the case they’d have no employees (and potentially law suits – yes, I’m serious. Some firings over things that may seem harmless can lead to lawsuits, whether its not a legit/unfounded case or not).

I worked with a guy who always used “rm -rf” to delete anything. And he always logged in as root. Another worker set the stage for him by creating a file called “~” in a visible location (that would be a filed entered as “\~”, as not to expand to the user’s home directory. User one then dealt with that file with “rm -rf ~”. This was when the root home directory was / and not something like /root. You got it.

(Note to mod: put this in wrong place initially; sorry about that. here is the correct place).

This reminds me of when I told a friend a way to auto-log out on login (many ways but this would be more obscure). He then told someone who was “annoying” him to try it on his shell. End result was this person was furious. Quite so. And although I don’t find it so funny now (keyword not as – I still think it’s amusing), I found it hilarious then (hey, was young and obnoxious as can be!).

The command, for what its worth :

echo “PS1=`kill -9 0`” >> ~/.bash_profile

Yes, that’s setting the prompt to run the command : kill -9 0 upon sourcing of ~/.bash_profile which means kill that shell. Bad idea!

I don’t even remember what inspired me to think of that command as this was years and years ago. However, it does bring up an important point :

Word of the wise : if you do not know what a command does, don’t run it! Amazing how many fail that one…

Peter, for whatever reason I didn’t see your response to my prank with regards to user profiles. I love the idea you mention too! I would never use this (or indeed what I told a friend ‘in case’ years ago) on anyone now but its still a fun thought/idea to read (it also brings to mind the number of ways that you can screw with users heads or screw your own head – too many to count without getting bored. Ultimately that is why we all – even those who are extremely cynical like me – should always keep in mind that trust is dangerous and given too often). Thanks for sharing that. Yes, it would be very annoying but probably less sadistic (especially since any user with a clue would pick up on what is happening and know the obvious locations to check) than what I ultimately caused: the person that ran the command did it on a remote shell, thereby disabling their shell account (lucky for them they were not the system’s administrator as if they would be willing to run a command without knowing what it does then they would likely be logged in as root ‘just in case’ and then he’d have a real problem – since it was remote he would not be able to rescue it by himself).

Incidentally, this whole topic (running commands without knowing what it does) while dangerous, can be good as I describe below. Take for instance users meaning to do: # last | grep reboot but instead do # last | reboot when they should have just done (and note the prompt change!): $ last reboot Had they not been root or just did ‘last reboot’ (better is not being root and better still is not being root and running last reboot) they’d not have the problem. Still, as long as they learn from it (and earlier than better and preferably before someone takes advantage of them for malicious intent) then it is – in my opinion – not a mistake but a learning opportunity!

Another good test is to first do “echo rm -rf /dir/whatever/*” to see the expansion of the glob and what will be deleted. I especially do this when writing loops, then just pipe to bash when I know I’ve got it right.

Rebooted the wrong box While adding alias to main network interface I ended up changing the main IP address, the system froze right away and I had to call for a reboot Instead of appending text to Apache config file, I overwritten it’s contents Firewall lockdown while changing the ssh port Wrongfully run a script contained recursive chmod and chown as root on / caused me a downtime of about 12 hours and a complete re-install

Some mistakes are really silly, and when they happen, you don’t believe your self you did that, but every mistake, regardless of it’s silliness, should be a learned lesson. If you did a trivial mistake, you should not just overlook it, you have to think of the reasons that made you did it, like: you didn’t have much sleep or your mind was confused about personal life or …..etc.

Using lpr from the command line, forgetting that I was logged in to a remote machine in another state. My print job contained sensitive information which was now on a printer several hundred miles away! Fortunately, a friend intercepted the message and emailed me while I was trying to figure out what was wrong with my printer :-)

tar -czvf /path/to/file file_archive.tgz instead of tar -czvf file_archive.tgz /path/to/file I ended up destroying that file and had no backup as this command was intended to provide the first backup – it was on the DHCP Linux production server and the file wad dhcpd.conf!

Funny thing, I don’t remember typing typing in the wrong console. I think that’s because I usually have the hostname right there. Fortunately, I don’t do the same things over and over again very much. Which means I don’t remember command syntax for all but most used commands.

Locking myself out while configuring the firewall – done – more than once. It wasn’t really a CLI mistake though. Just being a n00b.

georgesdev, good one. I usually:

ls -a /path/to/files to double check the contents then up arrowkey homekey hit del a few times and type rm. I always get nervous with rm sitting at the prompt. I’ll have to remember that -rf at the end of the line.

I always make mistakes making links. I can never remember the syntax. :/

I suspect you don’t remember the syntax to ‘ln’ because it actually has four invocations. As a tip: In all invocations except the last (which is specifying option -t which is to specify the target directory) the target of the link comes first. But see the man page as there’s more specifics (like the actual invocations themselves). And remember, if it is a symbolic link, the link you’re creating should not exist and if it does you either remove it first or specify -f (ln won’t overwrite it otherwise).

I think though that not remembering every thing is just human nature and that’s the great thing about having such details in the man pages which is easy to access (or if you have info installed, even more details in the documentation). Some would argue that if you can look it up, then its not a problem to not remember it; if you do remember it (e.g., from use over time), great, but if not there’s no harm in looking at the documentation.

I wanted to remove the subfolder etc from the /usr/local/matlab/ directory. So I accidentally added the ‘/’ symbol in a force of habit when going to the /etc folder and I typed from the /usr/local/matlab directory: sudo rm /etc

instead of sudo rm etc

Without the entire /etc folder the computer didn't work anymore (which was to be expected ofcourse) and I ended up reinstalling my computer.

I actually tried “rm -rf /etc” as root on my FreeBSD VM. Sure, you have to manually specify the file system to mount, and the computer can only boot into single-user, but the computer ran and booted otherwise. Similarly, I used EE to destroy /etc/passwd, and nothing happened ;) even after a reboot

Yes, you can repair /etc/fstab without a file editor (and I’m not talking about using backup, I’m referring to directly repairiing it with using stdin directed to /etc/fstab with for example cat), even (I’ve done this and I”ve remarked on it here). But you still need /etc/fstab for a normal system (I’m not going to remark on a system with a single file system because that’s different and even then you’ll have problems at some point (and I’m not even considering the more hairy problems)). And of course it will ‘run’ (see below) otherwise (and naturally only single user as there isn’t any other user !) – but I would argue there will be cases where that will fail you too, with /etc completely gone (and if nothing else not much will work right[1]).

And as for /etc/passwd and nothing happened: either it wasn’t synced (and not all file systems sync straight away but that’s just one thing) (or you didn’t delete it or destroy it proper) or you didn’t log in (in which case that is a way of saying nothing happened: nothing happened as a user (and any service that uses seteuid/setuid system calls will fail to start.. actually same with group counterparts).

[1] Good luck with many things without proper time set up (/etc/localtime ?). Anything that relies on it (which many things do, whether you believe it or not, whether you know of them or not) will be broken. Time is only one example. Rescue mode doesn’t count, either. In short: boot does not mean function. If that were the case, all you’d need is a boot-loader and something it can boot (keep in mind how small the boot sector is, and keep in mind that at that level you could boot it and halt the cpu – it booted but it is of little use now.. or instead of halt it you trap the interrupts so that no input of any kind works and if you want same with output…)… but it isn’t all you need.

In short: it didn’t run properly and something that doesn’t function properly should be considered broken (this is essentially what those who like work arounds, are relying on: it doesn’t work, it might have other issues you aren’t aware of and it is still broken. the other issues are many and anyone who has enough experience programming, or actually knows what faulty RAM can do, will understand this (although programmers will know more than that in some ways)).

Deleted the files I used place some files in /tmp/rama and some conf files at /home//httpd/conf file I used to swap between these two directories by “cd -” Executed the command rm -fr ./* supposed to remove the files at /tmp/rama/*, but ended up by removing the file at /home//httpd/conf/*, with out any backup

conclusion: check the directory where the rm command removing files of which directory

On a SunOS/Solaris box, or anything with dynamically linked critical programs [cp, cat, tar, sh], if /dev/zero or /usr/lib/crt0.so vanishes, you’re screwed. /dev/zero is especially insidious, because, in their infinite wisdom, they decided to make mknod a dynamically linked executable, so you can’t even mknod it.

Sure there were. First thing to take an ocassional backup before any modifications. And be careful with tnf testing of scripts with redirections to important commands with absolute path to the binary. In case. somehow I have found the mail binary in a same version server.

Other nice thing to remember. Do wait for password prompt. Especially when someone around you. Sometimes there are a few seconds to wait and not to type clean readable characters to the console.

My worst mistake was when I started using Ubuntu and changing abruptly (but willingly) from Windows to Linux. I accidentally deleted the entire filesystem with a command. No backups but it was a clean install.

Great post ! I did my share of system mishaps, killing servers in production, etc. the most emberassing one was sending 70K users the wrong message. or beter yet, telling the CEO we have a major crysis, gathering up many people to solve it, and finding that it is nothing at all while all the management is standing in my cube.

I did something very similar. I was creating all of the user accounts on the new samba domain controller that was going into production the next day. Everything was done and configured except creating the users. when creating the last user instead specifying her home directory as /home/staff/9807mr I designated it as /home/staff there by putting only way to fix was reinstall had all of the configuration files backed up but was still up all night recreating all the user accounts.

And the worst: update and upgrade while some important applications were running, of course on a production server.. as someone mentioned the system doesn’t run very well without all of its original libraries :)

Rebooted/halted wrong severs – done. (posted link how to protect from this on linux) stop wrong interface, firewall lockup – done. some fun examples. wanted to delete all hidden files in users home. rm -fr .* :-D guess if that match . ..? :) and the nasties thing I did recently is to run reiserfschk on a lvm device! ROLF! Thanks got it was on a testing setup server… neither reiser checked ok neither lvm worked after that. :)

Yep, most of the time. But not on a [pkill] command. [pkill -v myprocess] will kill _any_ process you can kill — except those whose name contains “myprocess”. Ooooops. :-! (I just wanted pkill to display “verbose” information when killing processes)

I issued the following command on a BackOffice Trading Box in an attempt to clean out a user’s directory. But issued it in the /local. The command ended up taking out the Application mounted SAN directory and the /local directory.

Another guy, wants to see the FQDN on solaris box and executed “hostname -f” It changed the hostname name to “-f” and clients faced lot of connectivity issues due to this mistake. [ hostname -f is used in Linux to see FQDN name but it solaris its usage is different ]

I was dragged into a meeting one day and forgot to secure my Solaris session. A colleague and former friend did this: alias ll=’/usr/sbin/shutdown -g5 -i5 “Bye bye Vince”‘ He must have thought that I was logged into my personal host machine, not the company’s cashcow server. What happens when it all goes wrong. Secure your session… Rgds Vince

I run a periodic (daily) script on a BSD system to clean out a temp directory for joe (the editor). Anything older than a day gets wiped out. For some historical reason the temp directory sits in /usr/joe-cache rather than in, for instance, /usr/local/joe-cache or /var/joe-cache or /tmp/joe-cache. The first version of the line in the script that does the deleting looked like this:

find /usr/joe-cache/.* -maxdepth 1 -mtime +1 -exec rm {} \;

Good thing the only files in /usr were two symlinks that were neither mission critical nor difficult to recreate as the above also matches “/usr/edit-cache/..” In the above the rather extraneous (joe doesn’t save backup files in sub-directories) “-maxdepth 1″ saved the entire /usr from being wiped out!

One more mistake that I do remember is copying a directory to another server but without using the recursive option. That copied the files found at the root but the files stored in the sub-folders were not copied.

My favorite mistake was sitting around waiting for 250 or so gigs of stuff to copy to an nfs share, only to find I forgot to mount the remote share and it all just copied into the directory in /mnt instead. Good thing I had a huge root partition…

… but did not press enter on the last line (as a joke). I expected them to come back and see it as a joke and rofl….back space… The unthinkable happened, the screen went to sleep and they banged the enter key to wake it up a couple of times. We lost 3 days worth of business and some new clients. estimated cost $50,000+

Something similar here. I use the [Num Lock] or the [Ctrl] twice for this reason first…

A was wandering that is there any harm to these versions than that the double [Ctrl] is mapped to the console switching on some KVMs?

plus one: There were a few good times routinely push [Ctrl+Alt+Del] commands on a Virtual host console with plenty of Windows servers and a few linux ones. You can bet on this. Very good trick to draw attention…

During my first job in aix while saving a file with vim, it happens that sometimes you press another key after pressing w so the file gets saved with the new name. Usually i simply delete these files and nothing more happens. But this is a task i have automated in my mind (rm -rf file)

i don’t know how my fingers reached the star key but once it happened that i saved the file as *.

Imagine what happened after i finished working in the script and went back to shell to remove the file and my automated ‘rm -rf file’ stuff came to my mind …. my whole user directory deleted ….

I’ve done the wrong server thing. SSH’d into the mailserver to archive some old messages and clear up space. Mistake #1: I didn’t logoff when I was done, but simply minimized the terminal and kept working Mistake#2: At the end of the day I opened what I thought was a local terminal and typed: /sbin/shutdown -h now thinking I was bringing down my laptop. The angry phone calls started less than a minute later. Thankfully, I just had to run to the server room and press power.

I never thought about using CVS to backup config files. After doing some really dumb things to files in /etc (deleting, stupid edits, etc), I started creating a directory to hold original config files, and renaming those files things like httpd.conf.orig or httpd.conf.091709

As always, the best way to learn this operating system is to break it…however unintentionally.

Ohhh, I did this once in an LPIC certification class. I had my laptop running Ubuntu, but we all had an account on SUSE box the instructor wanted us to go through the class on, so I was logged into that as well via ssh. Two identical-looking terminal windows up… you can guess what I did. The worst part was that we had been working for nearly an hour and some people hadn’t saved their files…

Attempting to update a Fedora box over the wire from Fedora8 to Fedora9 I updated the repositories to the Fedora9 repos, and ran # yum -y upgrade I have now tested this on a couple of boxes and without exception the upgrades failed with many loose older-version packages and dozens of missing dependencies, as well as some fun circular dependencies which cannot be resolved. By the time it is done, eth0 is disabled and a reboot will not get to the kernel-choice stage.

> …lost all my code just before the dead line :( Learned from borked git rebase –interactive early on in a personal project repo: tar workdir up (or copy it, be it rsync, scp, cp, or mc, whatever), only then continue with whatever you’ve a tiniest bit of doubt with.

I do the same thing when I’m afraid of breaking a source code repository. Depending on context, sometimes “cp -al important-files important-backup” is enough and it’s a lot faster than creating a tarball (that command creates a tree of hard links to the files in the original tree, so if you edit a file the change is visible in both directories, but if you accidentally delete some files from the original directory you can still reach them from the backup directory).

Like Peko above, I too once ran pkill with the -v option and ended up killing everything else. This was on a very important enterprise production machine and I reminded myself the hard lesson of making sure you check man pages before trying some new option.

I understand where pkill gets its -v functionality from (pgrep and thus from grep), but honestly I don’t see what use of -v would be for pkill. When do you really need to say something like kill all processes except this one? Seems reckless. Maybe 1 in a million times you’d use it properly, but probably most of the time people just get burned by it. I wrote to the author of pkill about this but never heard anything back. Oh well.

This is why i never use pkill; always use something like “ps ….| grep …” and, when it’s ok, type a ” | awk ‘{print $2}’ | xargs kill” behind it. But, as a normal user, something like “pkill -v bash” might make perfect sense if you’re sitting at the console (so you can’t just switch to a different window or something) and have a background program rapidly filling your screen.

Worst thing that ever happened to me: Our oracle database runs some rdbms jobs at midnight to clean out very old rows from various tables, along the line of “delete from XXXX where last_access < sysdate-3650". One sunday i installed ntp to all machines, made a start script that does an ntpdate first, then runs ntpd. Tested it: $ date 010100002030; /etc/init.d/ntpd start; date Worked great, current time was ok. $ date 010100002030; reboot After the machine was back up i noticed i had forgotten the /etc/rc*.d symlinks. But i never thought of the database until a lot of people were very angry monday morning. Fortunately, there's an automated backup every saturday.

tried to lockout a folder by removing it’s attributes (chmod 000) as a beginner and wanted to impress myself, did:

# cd /folder # chmod 000 .. -R used two points instead of one, and of course the system used the upper folder witch is / for modifying attributes ended up getting out of my home and go the the server to reset the permissions back to normal. I got lucky because i just did a dd to move the system from one HDD to another and I haven’t deleted the old one yet :) And of course the classical configuring the wrong box, firewall lockout :)

In a hurry to get a db back up for a user, I had to parse through nearly a several terabyte .tar.gz for the correct SQL dumpfile. So, being the good sysadmin, I locate it within an hour, and in my hurry to get db up for the client who was on the phone the entire time: mysql > dbdump.sql Fortunately I didn’t sit and wait all that long before checking to make sure that the database size was increasing, and the client was on hold when I realized my error. mysql > dbdump.sql — SHOULD be — mysql < dbdump.sql I had just sent stdout of the mysql CLI interface to a file named dbdump.sql. I had to re-retrieve the damn sqldump file and start over! BAH! FOILED AGAIN!

The deadline were coming too close to comfort, I’d worked for too looong hours for months.

We were developing a website, and I was in charge of developing the CGI scripts which generated a lot of temporary files, so on pure routine i worked in “/var/www/web/” and entered “rm temp/*” which i misspelled at some point as “rm tmp/ *”. I kind of wondered, in my overtired brain, what took so long for the delete to finish, it should only be 20 small files that is should delete.

The very next morning the paying client was to fly in and pay us a visit, and get a demonstration of the project.

P.S Thanks to Subversion and opened files in Emacs buffers I managed to get almost all files back, and I had rewritten the missing files before the morning.

Had a .pl script to delete mails in .Spam directories older than X days, didn’t put in enough error checking, some helpdesk guy provisioned a domain with a leading space in it and script rm’d (rm -rf /mailstore/ domain.com/.Spam/*) the whole mailstore. (250k users – 500GB used) – Hooray for 1 day old backup

chown -R named:named /var/named when there was a proc filesystem under /var/named/proc. Every running process on system got chown.. /bin/bash, /usr/sbin/sshd and so on. Took hours of manual find’s to fix.

second day on the job i rebooted apache on the live web server, forgetting to first check the cert password. i was finally able to find it in an obscure doc file after about 30 minutes. the resulting firestorm of angry clients would have made Nero proud. I was very, very surprised to find out I still had a job after that debacle.

scp overwrites an existing file if exists on the destination server. I just used the following command and soon realised that it has replaced the “somefile” of that server!! scp somefile root@192.168.0.1:/root/

# cd /usr/local/bin # ls -l -> that displayed some binaries that I didn’t need / want. # cd .. # rm -Rf /bin — Yeah, you guessed it – smoked the bin folder ! The system wasn’t happy after that. This is what happens when you are root and do something without reading the command before hitting [enter] late at night. First and last time …

A few days ago to give myself root’s pemissions I asked a collegue to do “sudo chmod 640 /etc/sudoers” on his ubuntu box. Result – sudo not working completely, and root’s password was unknown :/ Booting from LiveCD saved our day. But I consider this sudo’s behaviour in ubuntu rather stupid.

I’ve done (just as quite a few other folks) chmod -R .* back then, and still consider not reading manuals, not experimenting small scale beforehand, and blaming a tool instead of myself when I’m at fault rather stupid. (however dumbed down ubuntu might already be, suggesting breaking a controversial* but still security related tool even further isn’t gonna win someone IQ awards eh?)

As Michael pointed out, the stupid thing is blaming a program for functionality when the person is at fault (please note I”m not trying or not even am calling you stupid, see next point). There’s a reason permissions are the way they are. Something may seem stupid, but does that make it so ?

For instance : You know why you need to be root to chown a file even if it belongs to you ? Does breach of security mean anything to you ? Because as I recall, that would (could) be the end result. And yes this is very much related to permission issues.

so in recovering a binary backup of a large mysql database, produced by copying and tarballing ‘/var/lib/mysql’, I untarred it in tmp, and did the recovery without incident. (at 2am in the morning, when it went down). Feeling rather pleased with myself for suck a quick and successful recovery, I went to deltete the ‘var’ directory in ‘/tmp’ . I wanted to type: rm -rf var/

instead I typed : rm -rf /var

unfortunatley I didnt spot it for a while, and not until after did I realize that my on-site backups were stored in /var/backups … IT was a truly miserable few days that followed while I pieced together the box from SVN and various other sources …

I chose to scan / instead of /home/user and I ended with a screwed apt, libs, and missing files from allover the place :D I luckily had –log=/home/user/scan.log and not console output, so I could restore the moved files one by one next time I use –copy instead of move and never start with /

these 2 happened at home, while working I’ve learned a long time ago (SCO Unix times) to backup files before rm :D

Heh, These were great. I have many above.. my first was reboot ….Connection reset by peer. Unfortunately, I thought I was rebooting my desktop. Luckily, the performance test server I was on hadn’t been running tests(normally they can take 24-72 hours to run)..

symlinks… ack! I was cleaning up space and thought weird.. I don’t remember having a bunch of databases in this location.. rm -f * unfortunately, it was a symlink to my /db slice, that DID have my databases, friday afternoon fun.

I did a similar with being in the wrong directory… deleted all my mysql binaries.

This was also after we had acquired a company and the same happened on one of their servers months before.. we never realized that, and the server had an issue one dady… so we rebooted. Mysql had been running in memory for months, and upon reboot there was no more mysql. Took us a while to figure that out because no one had thought that the mysql binaries were GONE! Luckily I wasn’t the one who had deleted the binaries, just got to witness the aftermath.

Remotely logged into a (Solaris) box at 3am. Made some changes that required a reboot. Being too lazy to even try and remember the difference between Solaris and Linux shutdown commands I decided to use init. I typed init 0…No one at work to hit the power switch for me so I had to make the 30 minute drive into work. This one I chalked up to being a noob…I was on an XTerminal which was connected to a Solaris machine. I wanted to reboot the terminal due to display problems…Instead of just powering off the terminal I typed reboot on the commandline. I was logged in as root…

I have a habit of renaming config files I work on to the same file with a “~” at the end for a backup, so that I can roll back if I make a mistake, and then once all is well I just do a rm *~. Trouble happened to me when I accidentally typed rm * ~ and as Murphy would have it a production asterisk telephony server.

Not fully inserting a memory module on my home machine which shortcircuited my motherboard.

On several occasions i had to use a rdesktop session to windows machine and use putty to connect to a machine (yep.. i know it sounds weird ;-) ) Anyway.. text copied in windows is stored differently than text copied in the shell. Why changing a root passwd on a box, (password copied using putty) i just control v-ed it and logged off. I had to go to the datacenter to boot into single user mode to acces the box again.

Using the same crappy setup, i copied some text in windows, accidently hit control-v in the putty screen of the box i was logged into as root, the first word was halt, the last character an enter.

Configuring nat on the wrong interface while connected through ssh

Adding a new interface on a machine, filled in the details of a home network in kudzu which changed the default gateway to 192.168.1.1 on the main interface. Only checking the output of ifconfig but not the traffic or gateway and dns settings.

> Why changing a root passwd on a box …do check that you can still access it while having a root shell open.

This applies to sudo reconfiguration, groups, uids/gids, upgrades(!), and to some extent to network interface configuration and firewalls.

I’d usually reconfigure iptables like this — under screen(1) of course:

[apply changes]; sleep 30; [rollback changes]

where “apply” might be iptables -A/-I (then rollback might be iptables -D or -F), or “service iptables restart” (with “service iptables stop” to let me back in). Sure the particular solution depends on existence of e.g. NAT rules to still access the system but that’s rather nasty a habit of itself.

If I press Enter after considering the changes being done and suddenly the screen stops to respond, then I’ll wait half a minute and hopefully get the console back to reconsider.

I’ve definitely rebooted the wrong box, locked myself out with firewall rules, rm -rf’ed a huge portion of my system. I had my infant son bang on the keyboard for my SGI Indigo2 and somehow hit the right key combo to undo a couple of symlinks I had created for /usr(I had to delete them a couple of times in the process of creating them) AND cleared the terminal/history so I had no idea what was going on when I started getting errors. I had created the symlink a week prior so it took me a while to figure out what I had to do to get the system operational again.

My best and most recent FUBAR was when I was backing up my system(I have horrible, HORRIBLE luck with backups to the point I don’t bother doing them any more for the most part); I was using mondorescue and backing the files up to an NTFS partition I had mounted under /mondo and had done a backup that wouldn’t restore anything because of an apostrophe or single quote in one of the file names was backing up, so I had to remove the files causing the problem which wasn’t really a biggie and did the backup, then formatted the drive as I had been planning………..only to discover that I hadn’t remounted the NTFS partition under /mondo as I had thought and all 30+ GB of data was gone. I attempted recovery several times but it was just gone.

my personal favorite, a script somehow created few dozens file in /etc dit … all named ??somestrings so i promplty did rm -rf ??* … (at the point when i hit [enter] i remebered that ? is a wildchar … Too late :)) luckily that was my home box … but reinstall was imminent :)

the extra space before a * is one I’ve done before only the root cause was tab completion.

#rm /some/directory/FilesToBeDele[TAB]*

Thinking there were multiple files that began with FilesToBeDele. Instead, there was only one and pressing tab put in the extra space. Luckily I was in my home dir, and there was a file with write only permission so rm paused to ask if I was sure. I ^C and wiped my brow. Of course the [TAB] is totally unneccesary in this instance, but my pinky is faster than my brain.

Tab completion is *so* handy, I love it. Back in the days, my zsh didn’t ask too many questions.

# rm -rf /etc/(something)[TAB][CR]

Note that ‘#’. Well somehow, the (something) part there got lost and my fingers, of course, were faster than my optical nerves and brain. Lady luck was smiling on me that day, this happened to my own workstation. Try running without /etc, it’s quite hilarious.

My own favorite was “chown -R userx:userx ../*” when I meant ./* in /opt/somdir – it recursed nicely, I cursed not so nicely. In my defence, I was trying to make sure I got the . files. It took many hours to straighten that mess out.

Another favorite was on day one of a new job. The local alpha-geek was hotdogging and he ran a script that pushed a new user into /etc/password on all production servers. But the script had no error checking and he ended up zeroing out /etc/passwd on every single one (30+ HPUX). It was like watching a slow-motion trainwreck. I felt much less intimidated after that ;)

In terms of that sinking feeling, I was telnet’d in to multiple production servers at multiple call centers (pre SSH – yes, I’m that old). One sever started circling the drain (known database ipc problems) and the only solution was a quick reboot before it locked up. I grabbed a window and ran shutdown. Of course it was the wrong window so I took down 250+ people at a remote site and let the sever lock up at my own site for another 250+ people. The remote site was bad enough, but after the hard power-off at my site I had to repair around 20 large ISAM database files which took about two hours. Now I try to use a different background color for each server I connect to.

You can always modify /etc/bash.bashrc and add:alias rm"=rm -i" Mine reads:alias rm=" echo -------; echo Think before you delete...; echo Use yes on stdin or -f if you must bypass the prompt; echo You have been warned!; echo -------; rm -i"

It’s not the best fix, and there are better ones out there, but it works fine for home desktops.

You’re right, it isn’t the best fix. I would argue it isn’t a fix at all. There are several things to consider. And while the thread is about top 10… hopefully some of this adds value because I feel that listing mistakes is a good way to acknowledge them and learn from them.

Specify -i if you must but do it at the command line. Relying on aliases is a disaster waiting to happen with something like this. Wait until you upgrade/install anew… or do something on a different system without that alias there(1)… your nanny won’t be there to protect you and then what ? And yes, that is what it is doing: babysitting. I don’t mean that as an offence but it is still true. Get in to the habit of using the tool properly because that’s the best way. Or don’t if you would rather be less efficient (and safe)…. not sure why anyone would want to do that though. I guess similar is people not backing up. Yes, typos happen but that can happen if you use -i or not as well as -f. (user error is yet another reason backups are important and I’m sure you know this)

[1] Don’t even tell me you won’t forget. You might not always forget and you might forget some things and not others. But you will forget things like all humans do. And depending on how long things were all configured, you’ll forget how many things (and rather what) you have to reconfigure it. But even then, if you rely on this ‘protection’ then you are only ‘safe’ when on your systems that have this enabled. If you get in to the habit in typing -i then you are safe always (unless perhaps some implementation of rm just ignores invalid options, but that is another issue entirely). That is the most important thing to consider with specifying it at the command line (if you feel you need it). I should also point out that while it is rare, botched updates do happen and can overwrite your configuration file(s).

and it deleted the foo directory from disk :(…lost all my code just before the dead line :( Oh boy … I did the very same thing leading to everlasting loss of dozens of TEX-files …

Another one happened to me when I was going to create a cronjob that deletes all files older than X days. So I was at the shell in the correct folder and tested it: find -mtime +23 -print -exec rm -f \{\} \;

Worked like charm … Therefore I was putting it just like that into the crontab and went to sleep. On the next day I got a 8+ Meg Logwatch-Mail … thousands of lines telling me some libs weren’t found.

The bad thing: I didn’t give a starting directory for find, since for testing I did cd to it … Therefore the cron started deleting from where it was: right away at / …. Lucky me I had a second box that was almost identically … *pheew*

But of course, the classics have also been done: wrong box, firewall lockout, halt instead of reboot, deleting the just-restored files from the production-directory instead of the temporary location under /tmp … there was lots of fun. And i bet there will be much more waiting for me to happen. ;)

One mistake I made was to run “slay” as an unprivileged user. This damn program by default (mean mode) will kill all your own processes in that case. It shouldn’t be shipped like that in a serious distro like Debian or Ubuntu, but they ignored my complaint.

needless to say this made the system useless. i’m not sure why i was even doing this when i could have done sudo cp /etc/passwd /etc/passwd.core. it think it’s because i sometimes do sudo sh -c ‘head -25 /etc/passwd > /etc/passwd.core’. the lucky part was that this server was not in production yet.

Nice. Thank you for openly sharing. For all the “rm” errors – I’ve learned the hard way and have replaced “rm” via alias in bash to “rm -i” that way I get a wakeup call as soon as I want to delete big-time. I have to type “/bin/rm” to bypass it.

# cat /dev/nul >/etc/motd came out for some reason as cat /dev/nul >/etc/passwd GOK why. A senior moment near going home time. Couldn’t find any old passwd files around so had to invent one with root in so I could log in at the console and extract the original from the backup tape. ps showed the users as numbers. No one noticed but my boss went into a sweat when I told him the next day :) SCO box. Last millennium!

hah! Hah!…majority seem to hv turtured “rm” to death. Surprisingly “rm” is still around… anyways my personal goof-up, about twenty years back, when there was no way of knowing where u r in the filesystem – except using “pwd”…no bash, no helpful PS1 configured to show ur current location, etc. (Actually after this episode I created a small script to show the current location).

So here I was on a Xenix 286 system (apparently a unix “developed”/supplied by Microsoft), in single user mode and thinking I was in “/tmp”, issued “rm -r *” (during those days /tmp was just a simple directory)…

Well! Instead of “/tmp” I was in “/” …rest u can imagine…also in those days “rm” was much more powerful…I don’t recall whether “-f” switch was yet invented or for that matter “-i”. Or maybe I was unaware!…”rm” was pretty raw and did what it was asked to do – no questions asked!

But from that day I treat “rm” as my extra-marital wife…be real careful…

It is more tricky than that: I think the problem is many misunderstand the use of -f and its need. The alias that is common, giving -i, makes it more confusing to those unaware. No, it wasn’t any more powerful. See the following to understand :

-r is all you need. Without -i you don’t need -f because there is nothing to override (-f is for something else but it is also used to override -i): There is a reason why scripts (build scripts like Makefiles, rpm scriptlets and so on) use -f and (on regular files too).

-f, –force ignore nonexistent files and arguments, never prompt

So -r is to recurse through directories. And without -i that is all you need. Put another way, -i is making it seem that -f is needed to remove directories but it is actually that the -f is needed to override the interactive option itself. Yet another danger with that alias… Mind you, -f would be useful in some cases, with rm -r but it isn’t that you need it to remove directories.

I had a Mail-System with a IMAP-Maildir-Structure. For any reason a single Maildir was created under root. And the Name of the User had an german mutated vowel in its name – so the name was not /.foo but /.foaer& or something. (Maildirs had a point in front of the name to know its a maildir) I copied the hole directory to the home-directory and wanted to delete the directory under root – and typed rm -rf . foaer& (you see this little space….). Now I had to do another work on the same machine – and i had to download a package with 300MB – i took wget and waited till it had been downloaded. But when half of the package was ready there comes a line behind the progress bar says: already deleted. Now I becomed scared – because of the “&” as last letter in the name of the directory – and then in the command – rm -rf erased the hole disk in the background…. And my last backup was about 2 weeks old – so, the hole company lost 2 weeks of mails – and i had a 24 hours job to completly setup the mail-server again – and this happend sunday night – my company startet at 06:00 o´clock in the morning to work. What a shitty night… But, after this there was a budget for a new Mail-Backup-System^^

I was trying to add a backup route to a primary sybase server and didn’t include the “hop count”. Since it didn’t work, I tried to “force” it and ended up deleting all routes from the routing table. :/

Done just about all of the above. The worst I have done is dropped the public interface on a production server at a conference center in the middle of DefCon 17 at the Riviera. Luckily there was IT on site to get the system rebooted within 5 minutes.

I had a usb drive with many folders of photos of my home in a main folder called “home”. I accidentally copied these as root and the folder was thus owned by root. Later on another computer went to this directory noticed the permission problem, switched to root, copied the photos and mysteriously typed: rm -rf /home in that situation that one “/” costs me a whole days work. Daft mistake, a well, live and learn.

I had a similar experience. I had an old machine with several users. I wanted to add another harddrive make the /home partition there and move everything across. The problem came when I wanted to delete the old home directory to get more space back on the original disk. rm -rf /home deleted the new home directory contents, including all the old home directory’s contents. at the end I had 100% of nothing in the most important user’s directory. No back-up and no excuses at all. This happened 6 months ago, and I could have told me (if I had bothered asking) that this would happen.

Depends on what you’re after. You can use find (with exec option or better yet piping it to xargs). You could also use a for loop. Too tired to give any examples (be thankful I am aware of this fact… that is the one that really gets me.. never know just how tired you are until it is too late, when you break a system or… cause another ugly situation (programming when very tired is dangerous… and no, this is not just for beginners, not at all[1])). But the idea is instead of using chmod directly you use it indirectly. Of course, you can also use it directly depending on how you invoke it (and again this depends on what you need). This would be, perhaps easier, even: if you use -R be careful with what paths you use (hint: check ‘man -s 7 glob’ to understand how it all resolves).

[1] The only times I’ve made mistakes of any significance is when I was so tired I didn’t know just how tired and dangerous of a situation I was in. Example is causing memory corruption in a fairly large program I work on. Thankfully I am very good at debugging and otherwise solving problems. Still it took far more time to fix than implementing the feature I was working on in the first place (close to a fortnight versus maybe 5 min change).

That’s saved me a few times since then. The .bashrc file also already had the “alias rm=’rm -i'” in there, which at first I hated, but learned to like it after it too saved my skin on more than one occasion.

Accidently rebooted servers…best is uname -a before you hit reboot command. Wrongly plugged Sun M-series with one input 110V and 220v on the other. Server wouln’t start whatsover you do… Best backup is dump or ufsdump (solaris), most of the time tar and cpio may lead you to loose your job.

I have screwed myself most ways mentioned. My shining moment was building a firewall for a local government site. They said port 1049 was experiencing lag, so to debug I decided to clear the firewall. INPUT and OUTPUT were set to reject and I typed: iptables -X really bad part? It was a no-access VM, had to call the host to chroot and release.

Too bad, you should be running more or less serious hardware (for that mistake’s scale) to handle a dozen of active I/O+CPU jobs efficiently. Otherwise (e.g. on a older dual socket or even modern quad core) it would rather increase seek contention and scheduling overhead thus total execution time.

This was in the age of laptops moving to having NO more floppy drives, I needed some way of removing grub as it was no longer needed. Since there were no floppy drives I was searching for an alternative to the win98 bootdisk so I could run a “fdisk /mbr”.

Found a nice little program that I burnt to a disc, that would allow me to modify the boot sector.

– Transferred the tar to my home directory on the new server and extracted it to get some config files: newbox> tar xf etc_oldbox.tar

The newbox was killed. Why? The native Sun tar doesn’t remove the leading “/” by default, as GNU tar does. As a result, the whole /etc content was completely overwritten with files from another server. Linux people, be careful with Sun Solaris!

I’m calling nonsense on this. Either it was not a true CentOS install (which is discussed in their FAQ or some such… web and virtual private server where the host – company – claim they have CentOS but in reality it is a mock up at best) or it was very early version without the way it is now (and I somehow doubt that). Because the truth is yum will remove dependencies and guess what yum relies on? Exactly – python. So you’ll get something like this (just tried it even… though I knew the result already):

Error: Trying to remove “yum”, which is protected You could try running: rpm -Va –nofiles –nodigest

Of course, that is not to say you couldn’t remove python but that command as you give is not going to cause that in any way, certainly not for many years (2010 included, which is when Daniel posted the message unless my eyes are tricking me).

Python is not the only package that this protection will occur with, either. Further, if it is a plugin (or setting) it is most certainly _not_ set that way by default. That would be beyond stupid. Of course, fixing it wouldn’t be too difficult, assuming you have physical access, but that’s another issue entirely.

I was stating that what was claimed as something that happened is a fishy description. So in other words: if there is a mistake to be described and it isn’t a mistake, then it is off topic (following your suggestion that it should only be about the topic: an actual mistake made at the command line, lest it becomes a “support thread” albeit with a very strange definition of support). But I don’t think it is off topic because the point of their statement was that it broke the system by a mistake (at the command line) and that is regardless to what truly happened (indeed, I could claim I did the firewall lock out one but truthfully I have not… still, if I claimed it, would that be off topic since I actually didn’t do that? You know, kind of like I wondered about how he managed to pull of what is blocked by default…). Shortly: My response was nothing of support – not even close to – and neither was it anything off topic (it was in response to something on topic). Therefore, I wasn’t turning the discussion anywhere at all (unless your idea of turning the discussion is in fact participating in it…). Matter of fact, I was in the prompt thread at one point trying to explain something to someone BUT I realised that it was beginning to turn into helping rather than discussing, and I _personally_ stopped it and mentioned the reason for it. The only thing off topic is THIS (which I shouldn’t have to explain)! And if this sounds as me being grouchy, well, I cannot help that fact and it doesn’t change the fact I did nothing wrong (no matter how you interpret it). If I misinterpreted your response, then I am sorry for that (but just) – it was a really bad night last night. Bottom line: I was on topic.

This was years ago and now I think back about what an idiot I was. I created a user, mike of course in a box running DEC 4.0E and I decided I wanted more permissions so I decided to make it user 0, just like root. The system actually asked if I was sure I wanted to do this. I said yes….It took me 2 hours to get the system back up and running because it changed every file in the system that was owned by root to mike. dumb, real dumb. Being superuser can be a disaster! I use dd a lot now and make many backups.

Stupid french AZERTY keyboards have the * key located just on the left of the ENTER key.

while typing some : mv /path/to/file /usr/local/ I pressed the ENTER key… pressing the * key at the same time :) It ended up executing : mv /path/to/file /usr/local/* which moved every dirs in /usr/local into the last directory in /usr/local

On at least three entirely seperate occasions I’ve been on the command line within a C project I’ve been writing and gone to remove backup files:

rm *~

but been distracted after typing the asterisk and then switched back to the terminal and immediately pressed enter forgetting about the tilde sign. Immediately deleting weeks worth of *.c files.

Luckily I found ext3undelete, and while I did recover my files, it painfully insisted upon recovering the entire partition. Which of course happened to be the largest partition. So I then spent the rest of the evening not only waiting for it to restore the files I actually wanted, but manually deleting everything it restored which I had *not* deleted in the first place so the partition I was restoring to, did not run out of space before the files I actually deleted were restored!

I already mentioned this here but it is worth mentioning again. Whether he meant last | grep reboot or not, what he _really_ meant (read: should have done) is:

$ last reboot (Of course, yes, insisting on being root “in case” is a problem as you point out, which is indeed why I used $. Of course, I could also have used % but I hate csh – it’s a joke especially for C programmers like myself).

“Best thing” happend to me. Tried to clone a Serversys after 12 hours configuration. But typed: dd if=/dev/sad of=/dev/sda bs=1024 Work was done for nothing. Samething even happend on Windows Server 2008 Backup Tool. What have I learned, check your syntax and read warnings when they appear … they are usefull at least for something …

I work as a student worker at a university and as a web developer there have some limited sudo access on the production web server. They upgraded to a new server and didn’t warn me that they hadn’t made bash the primary shell. So, naturally I typed the same commands I always have to change ownership of a folder to myself so I could make some changes to files and inevitably changed ownership of the whole server to myself! Oops!

We had a cas of rm -rf / at work recently, but thankfully I was not the culprit. After that, they renamed the OPS team to OOPS.

My fav was a laptop which dual-booted windows 2000 / linux. I installed VMware under w2k with raw disk access, which allowed me to boot the linux partition as a VM while runnining windows. This worked great for about a year, until I gave the wrong partition number to mke2fs and formatted the NTFS as EXT2 while w2k was running. W2k didn’t even notice until it’s fs cache was cleaned out by a reboot. After that it was–how shall I say–completely fscked.

Once I installed a Linux server and forgot to log out of the local console. I noticed this and foolishly decided to pkill all the users processes. The user happened to be root and one of its processes was sshd; thus meaning I locked myself out of the box.

The best one I have seen (and had to rebuild the server afterwards was someone’s Q&D tidy up script to clean old files out of /tmp…

cd /temp; find . -mtime +30 -exec rm -f {} \;

Practically everything on that line was fine… except it failed to complete the change directory and then went it’s happy way from / deleting anything that hadn’t been changed in the last month until it hit something important at which point the serverwent down. It was run by cron as root in the wee small hours so no-one spotted the effect until the next morning.

The best thing about this apart from it not being _me_ that executed the command… the management believed the line that it was a virus… on AIX a decade ago.

Oh, I forgot the other really funny error… customer who wouldn’t use vi… copy /etc/passwd to a PC using Windows built-in ftp, add a user in to it in notepad and then use ftp to put it back… and added a ^M to all of the entries!

First week with my new Ubuntu… – Program: Libc6 needs to be updated to run – User: – User: sudo apt-get remove libc6 – System: WARNING: DELETING THIS LIBRARY MAY AFFECT YOUR SYSTEM. TYPE ‘I’M AWARE OF THIS AND WANT TO CONTINUE’ IF YOU WANT TO PROCEED. – User: types ‘I’M AWARE OF THIS AND WANT TO CONTINUE’ … – System: *sigh*… ok… deleting everything that depends on C (the whole system) – User: open eyes, stomach twists, push “Power” button in panic – User: restart computer, nothing works. Grub had problems, can’t access windows installation. Doesn’t have Live CD. Has to ask to friends for a Live CD, and explain the whole story every time.

Sounds like I know that. Few years ago as newbie linux admin tried to upgrade very old Slackware router with single IDE drive. I discovered that openssl library has bug reported and wanted to upgrade it. But system said: first upgrade libc6 (or libc5, whatever). So I did it. It FAILED. Rebooted box (it has high uptime, nobody rebooted it long time). System didn’t come up. Also, I saw on screen: IDE TIMEOUT.. blach blach. Drive died. I had backup machine with clean Debian installed on mirrored raid, but still missing some things.. Lerned how Iptables works within 1,5 hrs :))) It happends.. almost exacly 28001 hours ago, becouse this server is still up and this is hard drives “Power on hours” attribute.

Can’t top the keystrokes, but think I’ve got a top 10 spot for consequences:

One of those ‘start rebuilding the wrong box in a failed redundant pair’ scenarios. Sadly it was the dispatch note / purchase order processing system for largest European lights-off central distribution center in a very big industry sector, and config had been ‘evolved’ or time. Result: they couldn’t dispatch anything and had to turn away every delivery for a day and a half. We are talking about a hundred cross-continental trucks sent back to their depots empty, or still full. I could tell you that it wasn’t me, but I don’t expect anyone to believe that.

Moral: Never, ever, ever trust a sticky label – look at the prompt, and if the prompt doesn’t tell you: change it so it does. SVN’ed config would have massively reduced down time too.

Was on a server, one of a dozen for TRAINING airmen and soldiers on a military base, not to track weapons or shoot anything. The same training scenarios have been run for months. Logs (multiple hundreds of megabytes were collected per day) are kept for at least a year. I wanted to copy a day’s log collection and reset the log queue so the next daily logs don’t get too terribly big for analysis (search for unauthorized logons.) I entered the command: # copy {logfile} {logfile.bak}; rm {logfile}; touch {newlogfile}; {start proven analysis script on logfile.bak};

Copy is a DOS command. Rm worked. I had no more logfile for one day of training. I immediately and voluntarily told the information assurance team lead that I accidently deleted this one log file.

I was escorted off base within 15 minutes, no longer employed. This is a true story. I wish it weren’t. There is no humor in this event.

Lesson for my next job: the rm command is the enemy when combined with idiot information assurance staff members. Avoid it. Use the mv command instead.

Yes, I have done some excellent fatfingers, IDtenTs, etc…more importantly here is what I do now to help combat human error…my personal favorite is /sbin/service network down and forgetting you are not on console…Doh!

1. Measure twice cut once – Look carefully at the commands prior to running them, don’t ever hit unless you really know what the command will do. 2. Build a practice – Being a SysAdmin is like being a doctor, think about how you do what you do, do it consistently…document it in a wiki ideally. 3. Take the wiki documentation you create and automate what you do in scripts so life is easier, and there are less mistakes, as you test your scripts on a lab machine. 4. NEVER TEST ON PRODUCTION 5. Create build plans for what you plan to do if it is complex…refer to your docs when you do it, ideally test your docs on a lab server or a DEV server prior to doing it on PROD. 6. Play nicely with others. 7. Hire junior admins to do the junior stuff to keep the senior admins to mentor, and do senior stuff…

Created perl script to fetch files from remote server but forgot to add check of directory existence. All of the remote files were lost because my scp command was overwriting them on the destination path.

I typed the 3 fingers salute on a windows server to log in on a kvm switch, but unfortunately, I was on the ERP under RHEL, the trap wasn’t desactivated… reboot took 30 minutes, it was while a work experience and I was alone at the I.T. department.

I loved this. I wish I could remember some of my specifics, but I’ll just have to add a general comment. It applies to anyone sitting at a console – and I learned them the hard way a long time ago.

Never test a script or program on a production server. It WILL burn you eventually. We’re human, and we make mistakes.

Test every file manipulation command by first listing the files that will be affected without actually modifying anything.

Let your users know you’re taking the production system down well in advance, so they have plenty of time to prepare. Emergencies do happen, but you’d better be able to explain to the CEO why his administrative assistant’s work was suddenly interupted, or the presentation they were giving to a customer that just flew 12,000 miles to see it, was interrupted.

Never run through the hallway like you’re racing to put out a fire – it scares people. Calmly act as though you have it all under control and use the extra time to THINK about what you can do to resolve the crises. Never go to your boss with an ill-formed, by-the-seat-of-your-pants analysis. If your wrong, you’ll look like a reactionary or a chicken-little. Take the time to think, analyze, and consult w/ peers. You’ll look more professional.

Dave, I agree completely, even the part about not scaring them LOL, but you did forget one thing:

Have a backout plan and TEST that plan on the test system prior to even thinking about touching the production system. When you are certain that plan B works it’s really easy to look calm while you stop to get a cup of coffee on the way to the data center!! Trust me on that one.

While cleaning up someones mistake of running this on a live production box:

rm -rf / home/[…]/file (note the space)

All that was left was an SSH connection and existing services – the box was maybe half operational with no one onsite at the DC, restoring files to the box from another similar production machine with RSYNC. Went to restart a daemon with:

This these aren’t Unix mistakes, but I was backing up files from my old Macintosh (Mac OS 7.6), so I sent them via FTP, but the program I was using defaulted to text mode. The real mistake was not knowing about md5sum .

Also, I used TI-Connect to back up 2 years of BASIC programs from my TI-89. After backing up, I wiped my calculator. I wanted to move the backup archive to another directory, so I issued Cut from TI-Connect so I could Paste it into another. Paste didn’t work.

A perfect evening. At 3AM in the morning I finished revising some last minute changes on my local ubuntu virtual box, pushed them to git, updated the dev servers AWS instance, rechecked and then, ready to sleep and shut down my max, i entered a “init 0″ to shut down my local ubuntu box… good night…

unfortunately it was not the ubuntu box i shut down, but the AWS instance.

As AWS instances will be deleted on termination and shutting down an instance results in termination of the instance, the complete dev server + all the data was gone… No backup ( the instance had a ebs connected to contain the db data, but a year ago we switched from postgres back to mysql. for some pleasant reason, only postgres was using ebs…)

I was making testing some changes to the database schema locally for a live remote database. After doing one of the tests, I was going to drop the local database and restore a previous version from a backup file. Unfortunately, I had both the local and remote database admin panels open in a separate tabs in my browser and I dropped the wrong (live) database.

I had made backups of the live database shortly before this, but there was an error in the dump and only half the database was there.

And the datacenter didn’t realize that we had wanted them to back up the database dir also…

yesterday I installed RHEL-6 Beta and was logged into it from home via ssh. I had an X session active on the server which I left running yesterday when I left the office.

/home is a LV and I wanted to reduce it. but i could not as it said volume in use. so i thought running killall5 will kill my active ssesion X Session (as lsof showed only my X-session is using /home) and will leave my ssh session from home running.

haha, I ran killall5, and my putty got disconnected, I can no longer connet to my server. I will be going to office on monday and will start the ssh service.

Fortunately, this was a test system I had configured to playaround with RHEL-6.

Maybe “you can ruin a single backup directory with e.g. rsyncing an empty dir onto it”. With snapshots (think cp -al), one has a bit more trouble destroying that backup — still replacing a heavily hardlinked file’s contents will ruin its contents in all the hardlinked snapshots.

So disk based backup is complemented really nice by tape one (modern tapes can take around 800Gb uncompressed data at ~100Mb/s, and tape changers cost somewhat less than a fortune by now). Bacula is highly recommended and worth the time.

In my home directory, I usually have a couple of .torrent files named “[isohunt] foo.torrent”, “[isohunt] bar.torrent” and alike. Even if you’re sure you have many files starting with the same letters, don’t type $ rm \[iso*

In one case, there was only one such file, so what I did was essentially $ rm \[isohunt\] foo.torrent * instead of $ rm \[isohunt\]\ *

erasing my entire home directory (at least without subdirectories). What a shame

I had two similar PCs where I had to install Ubuntu and similar software. I am a lazy boy ;), so I’ve installed Ubuntu on one of them (let’s say BoxA) then I inserted HDD from second one (let’s say BoxB) to BoxA and run:

dd if=/dev/sda of=/dev/sdb bs=1M

to clone this HDDs. For some reason HDD with working system was sdb after reboot. So in result instead of two working PCs I’ve got two clean HDD drives.

Yes, linux has quite a few weird (well, illogical) things. You had one drive, it was assigned sda device name. You added second drive to clone firs onto, but when you booted _linux_ live CD, former sda device was assigned sdb name and second drive: sda. Expect dd cloning to fail unless you know linux id weird here or better test which device is which. The same is (at least was for quite some time) about network interfaces. Linux names devices in an order of discovery of devices, only it reverses it as if it pushes devices into stack, and assigned names later when pulling them from stack. BSD in this respect is much better: it assigned device names according to their physical place in hardware.

Valery, please note (as was noted to me before) that this page is explicitly for recalling troubles and not trying to market one’s crap over someone else’s crap, especially when one blames Linux instead of his own krivye ruki.

With FreeBSD in particular, you could end up with a kernel panic *just* by going over IIRC 7 vinum’s software RAID volumes back in the day, or by plugging a USB storage device into a router not so long ago.

Just in case, I did mess with SATA drives too by plugging “the next one” in a socket with the lower number (that’s pretty easy when one got 6 or 8 of them but quite feasible with even 4 or 2 when there’s no decent light source, you see). *BUT* I didn’t reverse the dd if/of so far due to a habit of “fdisk -l” and other double-checking before things like that.

One of the fundamental mistakes a *NIX admin can do is listen carelessly for some local “authority” who would “back” their words by being loud and proud, and not by being actually knowledgeable and reasonable (the test is “why?”). That’s pretty wide-spread in Russian-speaking *BSD circles, unfortunately.

Funny you’d say that Michael. I actually came to this thread for this same reason, but got caught in some other messages I wanted to reply to.

Indeed, there’s such a thing as labels.. uuid’s and so on. Using a different OS for not knowing how to fix a problem when its readily available knowledge seems silly to me. All OS’s have their own flaws and strengths. So does however each human.

But that said – I used to love bsd and scoff at linux. I was even called an elitist by some and frankly I do not blame them. My attitude combined with really extreme sarcasm, I did seem like one! I think one ofthe biggest things I learned (after years of blindness) is to use what works for you! As it happens I much prefer gnu libc than bsd’s much less useful libc (maybe improved by now ?). And this is coming from someone who used tothink, say, C++ is bloated (more specifically OOP’s). My reasoning or even my “defence” was that even Linus says it is. A silly and stupid logic there (more like no logic). I knew that much (it was more of well ‘if he sees it..’). However, as a friend said to me: you mean the person who created a very bloated OS ? Its true, Linux has a lot of stuff that a lot of people will never use. That doesn’t make it bad or useles.

Point was well taken and I now love OOP (it’s very beneficial, its more type safe [in C++ versus say C], and really it gave me something new to learn!).

In other words: Linux has flaws. BSD has flaws. Windows has flaws. MacOS has flaws.. NOTHING is perfect. Use what works for you.

Many of my mistakes could be less harmful to me — and others — if I learned it earlier that knowing the *weak* sides of what is available and avoiding them is way better than knowing the strong sides and just relying on them…

Thanks Vivek, it was a decent idea for a blog post to share the bumps earned and prompt us colleagues to do the same ;-)

Wow, i didn’t mean to cause an explosion, sorry everybody… And compared to you, Michael, kernel expert, I’m just a humble sysadmin. I only put my dirty hands into kernel once, when we had 32 bit box with 8GB of ram: to get more than 1 GB for user data. I do use Linux a lot, as it’s something that just works for me. Way back I remember 3 year uptime of some of our Linux boxes. A couple of years back I started to seriously look at better alternatives, when Linux became more like windows: every 1.5 Months on average: kernel security update (==reboot)… Respectfully, – Humble Sysadmin.

I don’t mind if moderator deletes my posts: I agree with Michael, my posts are just junk compared to elegant and instructive shell errors found here.

One thing I couldn’t buy as ultimately devastating though: rm -rf / – if I ever manage to do it as root on my *nix box I expect /bin, /boot, and part of /dev gone (and whatever else could be in / alphabetically before /dev on that box). Then the device hosting root filesystem will be deleted, and this will be end of my trouble. The rest: /home, /lib, /lib64, /sbin, /tmp, /usr, /var will stay intact. Other opinions?

Just try that (not exactly that command but you’ll figure it out) in a virtual machine, then think of mmap’ed files, open handles, cached filesystem metadata.

On the bright side, on Linux at least one can salvage a (wrongly) deleted file at times by knowing it is still open by a running process, SIGSTOPping that process just in case, examining /proc/THAT_PID/fd/ symlinks and cat(1) the contents of the needed one, conveniently marked as “(deleted)”, into a safe place.

I’m not a kernel guy either, only fixed iso9660 perms back in 1999 or so for localhost :) but I know freebsd FUD when I see it, and if you failed to find a distro that works for you (like I did in 2001), don’t blame “linux” for it — it’s just not professional in the first place. Ну, не стоит уши под лапшу подставлять и дальше её тиражировать.

Back to topic: on Linux it might be safer to check /proc/mounts and not `mount` in case one has to doublecheck: I once had a trouble with a recently cloned hard drive after a reboot, blasting the wrong one with dd(1) after having missed the *real* mounts state (don’t remember the details but LABEL and UUID won’t help to differentiate between bit-per-bit copies, obviously). There’s a tendency to have /etc/mtab just symlinked to /proc/mounts though.

Re: “On the bright side, on Linux at least one can salvage a (wrongly) deleted file at times by knowing it is still open by a running process, SIGSTOPping that process just in case, examining /proc/THAT_PID/fd/ symlinks and cat(1) the contents of the needed one, conveniently marked as “(deleted)”, into a safe place.”

Actually, it isn’t necessarily Linux itself, at least the part about the files being “deleted” but still “existing”. That’s an inode thing. Indeed, more than one process can have a reference to the same file. That’s why deleting a file (eg with unlink(2), C system call) is not necessarily a complete deletion. See that man page for more details. Similar, moving (as in mv) a file will keep its inode while cp will use a new inode. This is handy when (example) you have in a Makefile (right before linking the objects into the binary) mv -f outputfile outputfile.bak or some such; if the outputfile is running, you won’t cause problems because you only changed the name (or put another way, with Linux’s procfs /proc/PID where PID is of course the pid of the program that is running, will contain the same information as before the mv).

As for salvaging files, again, see what I wrote about checking the man page for unlink (but again section 2!). Further, another way of checking (e.g., under Linux) for a file or any file in fact, that is open by a process but is deleted:

$ lsof | grep deleted (of course as non root you’ll likely get permission errors but if so desired you can do it as root). You’ll see files that are deleted ( and indeed it’ll show (deleted) ) with that command. With Linux’s procfs you’ll notice it under /proc/PID like you refer to. Haven’t used BSD or any other Unix in far too long to really remark on it except that inode is nothing specific to Linux (that or I’ve really forgotten some things…).

Yes, this was some what off topic but I think that since it was mentioned (deleted files and salvaging them) I would elaborate on why and _how_ that is possible and what is truly happening.

Isn’t that a good thing? I mean not having to deal with Windows? I would think it good. Humour aside: I’m not sure 99% is correct, now or in 2010. But certainly – not counting servers – it the majority, and depending on your setup I could see that being a problem indeed.

Heh, fixing what ain’t broken (admin-wise), and then postponing the “how *exactly* can I start what I try and stop now”. It’s like speeding the crossroads, might just make it 90% of the time but die the tenth time…

Having a lot of servers from one provider, all with similar hostnames, and reinstalling the OS of the wrong server.

Rushed setting up a server; normally I setup alias for cp so that it runs cp -R. So I need to back up an SQL database. I go in to the MySQL data directory and run cp database_name /root/db_backup/database_name. I proceed to run killall mysqld and then rm -rf database_name. Reboot the server and SQL comes up all is fine so killall mysqld and cp /home/db_backup/database_name /var/lib/mysql/. Bring back up MySQL and then try to hit the website. Realise that I have only copied the files and no directories, the database is incomplete and site is destroyed. No other backups.

Setup a script to automatically ban IP addresses on 5 failed login attempts to SSH and theres no timeout to remove banned addresses. I also blocked my server provider from getting into the box via SSH key. I changed the root password then logged out of the box. Went to sleep and forget what I changed the password to then after 5 attempts I’m locked out. Give my provider what I think is my root password and after 5 attempts they are locked out at the NOC.

Then you wouldn’t find anybody to hire (or you’d hire a liar who claimed never to make booboos). Everyone makes mistakes. _Everyone_.

Probably the best one I made in the last while was removing a program that didn’t work on my Mac at home… I typed in “rm -rf /Applications/”, thought that I’d typed the initials of the program and hit tab to complete it, then hit enter. It didn’t match anything of course, so about 3 seconds passed while it deleted programs before I noticed and hit ctrl-c. It got as far as C*, so I was able to re-install the stuff I’d nuked :/

The only ‘mistake’ as you call it is that which is not acknowledged. In other words, only fools (that is putting it far too nicely actually) that refuse to learn more (i.e. think they are perfect, knows everything, is never wrong and frankly anyone suggesting it is an idiotic peon!) don’t make mistakes. Yes, thinking about it, I would sure rather be in your position – being ‘perfect’ – because I really like mules and this way I would get a chance to compete against them! /sarcasm

By admitting to mistakes you learn. Without making mistakes you don’t learn much (certainly not as much as you could). If you don’t have anything to learn why do you even bother ? Everyone is below you so why do you care what they think ? (If you suggest you don’t then why did you write such nonsense ?) I’ve yet to meet a leader (which you would think being ahead everyone would indicate) that actually cares so much as to insult others for admitting they aren’t perfect (or insult for no valid reason). I’ve met a lot of followers that do exactly that, though….

It’s not hard. It’s not elusive. It’s not difficult to learn. Editors have their uses. But if you’re also afraid of > or >> then I guess you’d be afraid of sed and awk or pipelines or any command that edits a file. Think of sed -i. One screwup and you could wipe or make useless, an important file or many files. Yet, I guarantee it’s not only FAR faster, it is worth it and among the best solution for many problems (multi file search and replace and similar). And as for > and >>, here’s another example of how knowing how it works, saved my hide.

Imagine this: mistake in or missing entirely /etc/fstab. In my case, I think the root filesystem entry was gone but this was years ago so I could be remembering wrong. I may have even made a mistake. Whatever, the important point is, I knew how to fix it. It was on a FreeBSD box is all I remember. Regardless: you don’t always have editors. One such reason: think of linked in libraries. Another: a not mounted file system (possibly this is the most relevant reason in the case I fixed).

So how did I solve this issue without having to reinstall (remember: no editor!) ? Simple:

cat > /etc/fstab <> for multiple lines.

Yes, I actually reconstructed /etc/fstab by knowing how to use the shell.

So that’s one reason people insist on using it: it is VERY useful and not really hard to keep straight. It reminds me of the quote that goes along the lines of ‘unix is user friendly, but it’s very particular about who it is friendly with’.

A few years ago when I was still relatively new to linux, I had my windows drive mounted as /windows/c on the system. I wanted to delete my wine windows directory. So I moved to the .wine directory and ran rm -r /windows. I never had the opportunity to abort as I had left the house after running that command.

To this day I’m still grateful I had gotten into the habit of backing up data on a regular basis.

When a s/w installation went wrong, I decided to delete the incomplete installation, which included a /bin directory, then merrily type y for all files when asked if I wanted to delete it….lost system /bin, couldn’t do anything. Got Ubuntu live, copied /bin from a sister machine, then manually created all symbolic links. Found I couldn’t use su anymore, then gave chmod u+s su (and other similar files). Recovered all!

You got way lucky on that one. Reminds me of the time I ssh’d into the mail server to work on an Inbox and didn’t exit out. At closing time I found an open terminal on my laptop and typed sudo /sbin/shutdown -h now….it wasn’t my laptop I was shutting down.

On FreeBSD and new to rsync, set up rsyncd.conf with a new module section named svntrac to allow syncing from path /usr/local and including two subfolders (one for svn repositories and one for trac repositories). On client machine for backup as root ran rsync -avzr –delete server::svntrac /usr/local.

It backed up my svn and trac repos fine, but deleted everything else from /usr/local ! All installed programs gone just like that (including rsync!)

I once used rm progname * instead of rm progname*. Just one additional space in between the progname and the asterix. Unfortunately the system was used by two departments. Each department thought that the other one made backups of the system. So nobody did. Took me two weeks to rewrite the code. Since that day I always use rm -i progname* and check doublecheck.

Years later, I accidentally put the wrong permissions on /etc/passwd. So no one could login into the system. Not even root. Luckily some IBM guy came along on the same day to repair a harddisk. He knew a back-door into the system and saved my day.

I was once working on a mounted samba share when I discovered two directories which seemed to have identical content. Let’s call them foo and bar. To make sure I did a diff foo bar on the directories, returning no differences.

In order to free those 5GB of disk space, I continued to rm -rf bar when, after about 10 seconds, it struck me that I was working on a samba share and aborted the operation, luckily.

Next, I logged in using SSH and discovered that bar was only a symbolic link to foo, a fact that is hidden from the user when working on a mounted share :-/. Well… the 1.5GB that were already deleted could be recovered for the most part, but I certainly learned a lesson. Don’t trust samba shares ;-)

One time I was copying some neat stuff unix commands along with the process steps I documented.from a journal i had on Lotus notes…was going to paste into a txt file as I moved to a new mail system….. well I pasted the info into the worng window…..as i had several windows open at the time…. one of the commands was a shutdown command i had in some processing steps to prepare for some work to be done on a unix server……so I shutdown a production server,,,amazing how fast it shutdown….not so amazing how long it took to come back up…..lucky for alerts…instantly got an email saying system was down….i said ..what idiot shut down that box….oopps …that idiot was me……lucky most folks were gone for the day….so it was not as bad as it could have been

I love this one most of all because it made me think about the fastest way of recovering the situation: fire up the filesystem tools and toggle some bits? su and then cat the contents of /bin/chmod over another less important system executable? something much simpler I’ve missed? some reason a rebuild was unavoidable – please expand because I know it’ll be useful…

It would be more difficult if you had done this: chmod -R -x /bin In that case you might be able to rescue it with the current shell and something in /usr/bin, or else use a rescue disk such as knoppix, not sure…

My personal favorite that I did fairly recently was changing my login password, and forgetting to change my ecryptfs password before restarting… When I restarted, I could login with my new password, however when it would try to pass the new password on, it wouldn’t match the ecryptfs password, and it could not mount my /home directory… sucked!

I then made things worse by logging on to a different account, su-ing to my main account, and trying to change the ecryptfs passphrase to match the new password I picked…

I ended up getting cought in a loop of changing passwords and su-ing into parrallel accounts… I eventually gave up on it and formatted/reinstalled…. SUCKED!

My best error was following. I reinstalled Linux and copy home folder backup from ntfs partition. While I was removing the annoying executable flag from all files with something like chmod -x * -R I ended removing executable flag in whole /bin :(

Especially /bin/bash is very troublesome :)) I couldn’t log back I couldn’t do anything. Installation again….

Done this a few times – thought I was logged onto a Red Hat box and done “init 5″. Surprised to see server start to shutdown, instead of transitioning to multi-user with graphical login. Realise it’s a Solaris box, runlevel 5 on Solaris is shutdown and power-off…. :|

I assume you mean chown. But same thing with chmod. This is something I don’t understand with cases like this. If you’re in the same directory (as suggested with ./* ) then why not just do ?

chown -R user.user .

? (or equally as you do, user:user)

If you do: chown -R user:user .* then yes you have a problem because it resolves to .. and ../.. and so on (since you enabled recursion). But . does not have the problem because it recurses down, not up.

If it helps, here is how you can test this out in a safe way. The trick is that . and .. and .* (and everything else) is expanded by the shell(1), not the utility. So what do you think can mirror this ? If you guess ‘ls’ then you’re absolutely right.

ls -R .*

will show what files chown would act on if you specified the same option (as in -R) and the pattern .*

Similar is what I suggested above: ls -R .

will show you how it works.

[1] Not always true. You have escaping and quoting at the command line. But the idea is the same.

Summary: the shell is doing exactly as you tell it to it is just you have to understand its quirks to see this (and there are many quriks indeed).

being in /backup meaning to write rm -rf www/ i wrote rm -rf /www guess what, it was nice.. but the worst part was a second later I saw my mistake, and wrote the right command deleting the backup _ALSO_ just then I realized what have I done…

I had a years worth of MySQL backups using the XML format, unfortunately i had failed to read the FULL manual and therefore did not know that while MySQL would in fact write to an XML file, it could not read or import an XML file for the version on the server. Several panicked hours later, I had a working setup of the latest version (from the website, not pretty), and had managed to import and the XML file just to reexport them in a format readable by the version we had on the production server. Lesson learned: always read the full manual before trying new or “better” features. We all thought the XML format was great, more portable, etc… Oh, and the newest version (as of May 2010) could not import the files directly, they were too large and in the wrong format, I had to write a Perl script to do everything in chunks.

This is really pitiful … such terrible work habits and carelessness. Certainly no one is perfect, but all one has to do is be deliberate and look carefully at what they type instead of getting in a mad rush that will set you back hours or days.

One thing I have done is to create a super-prompt on root account:

>>> [515] user@machine 2010-08-03 13:09:30 [515] >>> /home/user >>> $

The “>>>” is not part of the prompt of course. This tells you exactly where you are, what command you are in, what time you did your last command.

Next … when you want to know what files you are going to operate on such as “*”. instead of doing the command “rm *” … do “echo *” to see the file list. Then backtrack in history and edit the command.

One thing that has bitten me before is getting too fast on history editing … especially in multiple windows. If you use the same history file for every login you run the risk of thinking you are repeating a command when you may be repeating a command that was typed in another window and added to the history file. The way around that is to create separate history files for every login in your init scripts. You must manually delete the old history files at some point, but sometimes it is useful to be able to search for a command you used in the past at some point.

All the rest of it is question of being here now … not going so fast you do not have time to think and see what you are doing. Computers are unforgiving.

Not a unix command, but thought I’d share mine. I was RDC’d into a remote box, and after having updated some info on it’s host dhcp server, I wanted the machine to pull down new changes. Instead of typing ipconfig /renew, I typed “Ipconfig /release”

needless to say, I had to go to the remote location and type ipconfig /renew myself =(

I’m not sure why I never responded to this, but I have to admit this is the funniest of them all, and I have thought this the first time I saw it. It is absolutely hysterical. Of course for me it is more so as being a programmer and also being really good at debugging (with or without debugger – indeed both) I know why you would want core files removed (besides the fact it holds the memory and call stack of the program at time of creation, which includes potential security risks, there is that issue of size …).

I managed to get physically assaulted (kidding, was just a a pretty hard slap on the back of the head), joked at for 6 months, and get a to-this-day epic dressing down from the boss in front of all my coworkers for a very simple, and idiotic mistake.

Had an SSH terminal for our DNS / DHCP server, which also doubled as a mailserver. I opened another tab inside the terminal window, to another server I was gonna install (it was supposed to be the new mail server). As a joke, I called a coworker nearby, and said “watch this!”. To his horror, I proceeded to rm -rf / in front of him, apparently on the production server. (you can see where this is going) He went white. I laughed and said “heheh gotcha. that was the old server. its now all ready to reinstall!”. He goes, “nooo, it was the production server, are you MAD?” – while turning from white to red. After a couple of “no it wasnt; yes it was”, I looked back at the screen.

Sure enough.. I typed the rm-rf into the wrong tab.

Icing on the cake? The work order was install a backup system on the dns/dhcp server, and migrate the mailserver to a new machine. Obviously there were no backups. Punishment: having all my coleagues leave early for the day, with instructions to me of “you will get out of here when the everything is back up!”. Was a loooong weekend.

According to the boss, I wasnt fired on the spot only because “well, at least we know YOU will never ever do another rm-rf withought thinking twice or thrice…”.

Sorry to pick on you, and don’t mean to, but this is one really big source of mistakes on computers, people getting emotionally involved. Like playing around … you were thinking more about other people and making a joke than what you were doing, so you were and will be bound to always make mistakes like this. Like playing around and not realizing what window the focus is in, or that your ssh has time out and you are now back on the original machine, or whatever.

With computers you have to think very carefully about what you are doing, and then look at it, and even test it if you can before you run a command that does something complicated, or any kind of “write” operation. You also have to think about the ease of recovering the data should something go wrong. Before we did upgrades on user’s machine we used to just naturally assume users were lying about have local data on their machines, so we would do a backup image to a admin server just in case. That saved a ton of data from people who sometimes did not really understand the difference, or were not totally thinking.

I used to get kidded about my seriousness and the fact that when we did common operations, upgrades, etc, I would look at a command line and ask everyone there if it was OK. I got neverending shit about that, until a few times when people did not do that we lost customer data.

We were doing an upgrade once and one hot shot admin was ready to hit the return key to start it. I asked him if there were backups and he said he did not know, but nothing is going to go wrong. I told him to make sure and backup the machine, and of course, you know what happened, becuase he got his emotions, his arrogance involved in it and could not stand to be questioned.

When I am hiring I try to look as best I can for this trait, because it is the number one problem with a good admin … that and just plain crookedness or dishonesty. Why are their so many anti-social sys-admins? ;-)

No offense, Brux, but isn’t this for people sharing their own mistakes? He clearly knows he screwed up and how, or he wouldn’t be here. We already know you’re good because you’re on a site that’s about improving your craft, but do you have a mistake of your own you can share for our education and entertainment? Mine is at #228, fyi.

by the way … i really love the individual icons that this site creates for its commentors … can someone please, please, please, email me and tell me what that is … it’s really cool and I’d like to use something like that myself …. please!!!! very very cool!

One time while I was somewhat mentally ill (no excuse really!) I was using my father’s computer with Mac OS X. I had created some files at various places on the system, and I was thinking I want to remove all the files I created. So I thought, well I can just run rm -rf /, that should delete just the files which I have permission to delete, and fail for others.

Not such a good idea, since he had an old filesystem mounted 777 without proper owner or permissions! Fortunately I did stop the rm process before it trashed everything. Luckily he did have backups of those items early in the alphabet, and didn’t lose anything important.

I wish more people would admit to having issues, when they are using Apple products….Still, I don’t think ‘mentally ill’ is what you’re after …. But at least you admit you have a problem using Apple products!

Now if you truly do have mental health problems, that is something else and something I by no means am dismissing (I’d tell you why but you’d never believe me…). I can see humour in everything and that is why I went that direction.

As an aside, why didn’t he have it 1777 ? And to that end I would argue that while you did something risky, he was equally as guilty to have his files world rwx ….

I had booted a fully working Windows XP box with my rescue USB stick, to show off to my friend, showing him various command-line stuff.

Then – as a regular user – I typed: $ dd if=/dev/urandom of=/dev/sda

Telling him that regular users don’t have permission to alter the hard drive directly, I hit return.

…

Uh-oh. No permission denied message. Hitting Ctrl+C, I realise my friend has just lost his partition table and Windows installation. Luckily, the data partition was separate and recoverable by TestDisk.

I had added my user to the ‘disk’ groups months ago without realising it.

To add, I wanted to deliberately delete an old 486-based debian install by typing rm -rf /etc (and so on) while the system was running, and I was surprised by how resilient it was and kept itself alive.

Maybe it’s better to avoid the mistake than committing and having to learn from it.

Got a new hard drive and moved the contents of the / and /home partitions from the old one to the new one. In order to boot from it, I should clone the MBR, so I wouldn’t need to go through the hassle of setting up grub. And there I went:

dd if=/dev/sdb of=/dev/sda bs=512 count=1

Note the “512” instead of “446”. Those 66 extra bytes had the old disk’s partition table and it was written over the new disk’s. Of course I hadn’t backup what I was about to rewrite. Now I’m doing all the copying again and adding “get a LiveCD which supports ext4″ to my to-do list.

First n00b error, that end up discovering a weird bug in RHEL: Typed reboot on the wrong server As soon as I realized the error, I typed – init 5 Runlevel showed 6 5 This locked the computer totally. Couldn’t reboot it anymore, couldn’t use it etc. Deadlock. I had to physically reboot the machine.

Second error was a better one. I was working offsite and decided to do some modifications on /etc/passwd using sed. So I made a backup: mv /etc/passwd{,.old} Bad idea: nobody could log anymore, root included, on the machine. We had to boot single user and restore the file for it to work.

Nowdays i have backups off the more important things. offsite every week, and offline every now and then. For work material, like source files and databases on a dayly basis. It did save my ass a couple of times :)

Once a time I copied some text in a technical deployment website to paste to a word document. Then I worked on something else and (I thought) I copied something. when I paste the stuff to my unix session it resulted everything from the deployment page was pasted there… everything was invalud UNIX comamnd except 1 line….. \rm ~

The only system-crashing thing I have done was to install busybox in what was supposed to become a initramfs, but forgetting to chroot. Nothing of value was lost, though.

Also, I once took a backup of a MBR on the same disk it belonged to, before whiping the old one (I don’t clearly remember why, but I think it had something to do with dual booting windows). Not being one to panic, I opened my own MBR in a hex-editor, found some common patterns, booted up the wasted computer with a live CD and made a small C program that searched through the disk for something matching a boot sector, finding a single one some 80 gigabytes out. The whole room applauded (… I wish).

AIX root login by default has no / as its home directory. so: 1. logged as root 2. cp -pr /tmp/root_home /root 3. cd 4. rm -rf * duh! now first thing i do when doing a fresh AIX install is to create /root and make it root’s home directory (via post install script) lots of classic above. fun but painful when it hits you. thanks!

i had setup a chroot to test something and mounted proc under this directory. after testing, i did a rm -rf testdir. wondering, why the rm took so long, i saw the mounted proc :( i had to restore the /home directory from backups

Kindly be very careful when you work as a root user. My recent mistake on linux box was as below. I wanted to check the reboot history for the system, and the blunder that i made while issuing the command is:

The correct command is:

last reboot

The command i entered:

last | reboot

Hence, the system got rebooted. It was a production system, and was a big escalation. Be cautious while issuing commands when working as a root user.

It’s habits, lack of experience, ignorance, showing off that hurt. Interfaces are important but second to all of that (reboot could check if stdin is a tty, but then again an ancient one wouldn’t and one might be less careful expecting it to be somewhat fingerproof).

I set up a linux machine without monitor to react on ctrl-alt-del to poweroff /etc/inittab: ca::ctrlaltdel:/sbin/shutdown -h -t 4 now

unfortunately I had TWO keybords for TWO servers and no KVM switch – guess what – I wanted to log on to a Windows 2000 Server – pressed ctrl-alt-del and heard the beep from the linux machine – just while compiling a newly set up kernel.

Locked out due to firewall reconfiguration. Done that. Quite some times. I nevel did liked firewall and firewall don’t like mw either.

rm -fr to the wrong path. Not really, but there was a case on a an old netra machine, were the filesystem was corrupted and when i issued ls in /tmp i could see the whole / filesystem underneath it. “just some crazy inode mixed-up resulting in ghost entries” i thought. Luckily there was a recent backup around :)

And the most embarrassing mistake thus far, I once was in a telco just before it went online. and one of the sun clusters there, had some strange network issues. “The problem lies in the arp cache, one network guy suggests. After clearing some arps and still no avail, i decide to clear everything from the arp cache! There goes the cluster interconnection, and i ‘m experiencing my first cluster split brain. One node immediately panics, while the other one is ..well no at it’s best. I rebooted the whole cluster. Fortunately it only took about an 30 minutes and resolved that network issue, which up until now remain a mystery.

At 2am typed init 0 instead of init 6 – doing scheduled upgrade including kernel. No one in the building had access to the server room. Had to drive 90 minutes into work to simply push the power button…

In my first sys admin job I did not quite grasp the concept of a dumb terminal…I wanted to reset the terminal so I typed reboot while logged in as root…The terminal did not reset but the server did…

My worst one was while I was working on the build/compile system of a program. The ./bin/ directory was filled with files, but I wanted to see what files the build system itself copied there. But instead of rm -rf ./bin/* I typed rm -rf /bin/* And as my luck would have it, I’d been doing some work requiring the root earlier and forgotten to exit. Had to reinstall the operating system, but at least I was able to backup all files and settings.

Another good one was su -c “passwd” and pasting the password. I don’t know what the I pasted, but it sure wasn’t the password I had in mind. Other than that I’ve done the “sudo ipconfig eth0 down” and the reboot on the wrong box. Once I wanted to play a prank on my friend and sent him an “:(){ :|:& };:”. I should have done that, it ended up causing him a lot of problems.

There’s a very good reason this was not a stupid mistake, but when I told it to my dog, he got up and went into another room so it clearly is not very convincing…

OK, it could have been worse – but since it should be recoverable, I thought I’d post the fix:

I remember thinking it was odd that with such a long list of mistakes, no-one had ever posted the quickest fix for their problem – I may be about to find out…

OK, so I’ve never put anything in /usr/include manually (I’d use /usr/local/include for that) so rpm or yum should be able to rescue me (dpkg or apt for Debian).

I’m assuming that only packages with ‘-devel’ in the name will deposit files in /usr/include. (-dev for debian)

I’m in a rush so I: rpm -qa | grep devel | cut -d ‘-‘ -f 1,2,3

then mess around until I have a list of names ending in -devel which yum will accept. I then create a new xen instance of the same linux flavor. Making sure my hosed xen dev instance and the new xen instance are updated to today’s release, I issue ‘yum install (package)’ for each of the identified packages on the new ‘box’.

I then cross my fingers and scp everything from the new instance’s /usr/include to the hosed box.

A very common mishap *(one of my power-users did this – took about 3 hours of my time to fix):

ftp the /etc/password file from a Unix box to a Windows box so it can be edited. Edit the /etc/password file in notepad or wordpad (which likes to add line-breaks etc and other formatting). Save it and ftp the file back to the Unix box. Viola – noone can log into the box and noone can su to root.

There is no way to fix this but bring down the box, go into maintenance mode and edit the /etc/passwd file manually (remove all the blank end of line characters from each line of the file).

(The last line was me copying some custom config scripts; I usually prefix them with the server name so I can find them quickly. The last line has the added bonus of removing files I’m still working on in my ~ . *sigh*)

I was working at a company which had a home made ticket solution, the database used mysql4 (MyISAM, AUTOCOMMIT=1) and a flaw in the design, a column which just included integers was created as varchar(10) and I was to update a phone number for a customer in the database:

UPDATE table SET phone=’1234567′ WHERE ticketid = 123;

This resulted in that all customers got the same phone number, it wasn’t trivial to get the phone numbers back, we managed to get most of them from a backup and the rest from Apache logs.

I went to backup passwd group and shadow because we were setting up a new server and needed to copy over the users and when i ran the tar command i did this tar -cf /etc/passwd /etc/group /etc/shadow user_backup.tar

luckily passwd had been backed up into passwd-, I phoned the datacenter and had to get somebody to boot to single user mode and replace the file….

I keep several clients’ data on my testing server, testing data in /aaadata. I keep customer1 data in /cu1data and customer2 in /cu2data etc. so its easy to mv /aaadata /cu1data and mv /cu2data /aaadata when I want to swap test sets of data.

One day I confused being remotely logged onto customer1 live server one day… did the mv /aaadata /cu1data and then the mv /cu2data /aaadata which (thankfully) did not work.

Luckily I quickly figured out where I was, did a mv /cu1data /aaadata to put it all back, kicked everyone off, rebooted and got away with it ! Close one ! (I sitll df -v before these moves – a little paranoia is good for you )

A lot of your mistakes are pretty tame compared to some I’ve seen in the wild and some I’ve committed myself.

When I was really young (this was my first Unix machine, a Sparc 2; I was maybe fifteen or sixteen…) and not only didn’t understand Unix permissions and was very frustrated by them. I figured that the easiest way to make sure I could access everything I needed to was to say

chmod -R a+rwx /

I was root, on what had to be Solaris 2.4 or 2.5, and let the command run for a while before thinking, hm, maybe this is a bad idea. The original “logic” was, “well, since I’m the only person using this machine, why shouldn’t I have permissions to read and write everything?” — I completely failed to understand what the execute bit did, for starters.

Permissions were so hozed there was no option left but to reinstall the machine. Learned that mistake but good.

Using *the* simplest command which can make a backup one knows *is* key, especially if tired already (it’s antagonist to being responsible).

I’ve hosed a production backup server’s disks (what an irony) quite recently due to: – we installed on a single HDD of two as they still await for a hardware RAID controller – I did mess half a year ago trying to “at least make a backup” (dd’ing sda to sdb) – I did pay attention to replacing UUID-based mounts by device-based – I did look into /etc/fstab and `mount` before proceeding with dd (all clear) – I did *not* look into /proc/mounts (where it was a rootfs UUID-mounted from sdb) – I did *not* perform a most basic off-host backup I easily could, with rsync

What was still on the positive side, the tapes weren’t harmed by that dd (20 gigs in when it finally struck me), and there was somewhat older snapshot of /etc/bacula, and while ls would segfault already (had to use echo *) some other tools still worked off the damaged rootfs and so some more parts were salvaged but it was a reinstall and waste of time, even if not hitting users heavily.

And yes, I did off-host and off-site backups after reinstalling on software RAID. Working/commuted iKVM did help immensely either.

A habit of running a primitive local snapshot after considerable changes still holds: # tar zcf /root/BAK/etc-`hostname -s`-`date +%Y%m%d`.tar.gz /etc

> dd if=/dev/urandom of=/dev/withManyData count=1024 bs=1024 > i forget it….i the last command i haven’t way to rollback only reinstalling Not exactly: DO NOT REBOOT while you still have the kernel which mounted the filesystems while all information to do that was available.

STEP AWAY from console, have some tea, sit down and try to calm yourself.

Then estimate the consequences of losing the info and if it’s valuable, consider the possibilities to save it.

Second HDD is OK if it’s already mounted, otherwise consider networked backup (rsync/scp). USB HDD/flash might not work already if you had the luck of damaging kernel module file contents which would be needed to use them.

Then backup up /etc, /home, /var or whatever might still be readable.

Then have some more tea — unless the downtime is really pressing on you or you have 99% solid backups. Maybe you’ll remember some more stuff.

Only then say goodbye to that filesystem and reboot, afterwards only salvaging tools might be able to help (testdisk is easy-to-use but limited, gpart is a PITA but did help me to recover partitions after installing a distro on workstation disk not the test one; photorec reportedly helps with salvaging files from damaged filesystem).

Aside from firewall/sudo lockouts, chown -R smth .* (back when it did go from /home/smth to /home, which was a minor disaster), and aforementioned “dd backup server down” case (BTW I did ask a colleague to advise/witness me then took a cup of tea then mailed those concerned in half an hour when downtime and reinstall was considered inevitable), I could recall these…

One of my first Linux systems (a RHL5.x, libc5) was hosed by very much wanting to upgrade xmms-0.7 to xmms-0.9 from a “pirate” “Red Hat Linux 6.01″ (actually 5.9, a 6.0 beta) CD where it was linked against glibc2. Well, rpm did try to stop me. And I thought to cheat it around. After rpm -Uvh –force glibc*rpm finished what it was ordered, even ls would blow up. It would be only years later that I would know how to recover from that situation (like, boot the 5.x installer — there were no livecds yet it seems — mount that filesystem, copy glibc rpms there, rpm2cpio | cpio -id at least the /lib/libc.so and VIP friends, and then try booting off that root to rpm -Uvh –oldpackage that glibc*rpm — or keeping rpm-static at hand in the first place). Well, at least glibc2 is actually nice regarding backwards compatibility.

Several years later there was a Very Important Dump sitting in /tmp with stmpclean set up to shoot down month-old crap there; the next day was a moment to remember. My fault for placing it there in the first place but had also to talk with folks who set it up so in packages for the distribution installed.

On the same job, we once had to take development environment into production real fast — and chose to continue running backend on an office server where it was already deployed. Then we had tough time moving it to production servers at colocation, and then summer came with increased power consumption for downtown’s conditioning systems. One day we were running on all the office UPSes with two-minute downtimes to change them (no second PSU to juggle cables, and we didn’t chain up UPSes for the reason I don’t remember already)… the decision to move production services to production environment became a bit more obvious ;-)

On a community-built LUG server I was relying on a donated DAC960 controller to handle a separate SCSI HDD holding the root filesystem for an FTP server for some time — until it began to sneeze on a virtually new 18G drive and fire a single physical volume from its single logical volume… since virtual environments (linux-vserver back in 2004, openvz by now) lived on ATA/SATA drives (there was a mix back then), I would end up with perfectly working but unmanageable server until that half-a-meter board was retired to a stand. I thought it’d be cool to run IDE+SATA+SCSI, in fact it wasn’t — and “coolness” is a bad factor to account for.

Didn’t reboot/halt the wrong system so far — a friend of mine did, and he told us the short story to remember (thanks Nick!). Didn’t get caught with crontab -e/-r so far either, as well as prepositioning / when it wasn’t meant to. Well, lots of wisdom read, thanks.

Bonus tracks:

A colleague was experimenting with FreeBSD 4.x softraid (vinum) on a production server with some 4 HDDs and found it the hard way that after hitting some pretty low limit of volumes the kernel just froze up. He didn’t expect that in all honesty but we were down for that night.

Another anecdote was of a junior who decided to test how the hardware RAID5 works and pulled a drive from production server, then pushed it back and pulled anoter one. The very same moment the array was ruined — poor guy apparently didn’t understand that it takes time to rebuild an already degraded array after pushing first drive back, and that he was unneccessarily risking double fault even if he didn’t pull the second drive (if one of the remaining spindles would be close to fault it might not bear the added load of rebuild and go down bringing the whole array with it, again).

PS: yes, I did notice the difference in stories where there was a backup (“phew!”) and when there wasn’t. Taking care to review and test backups, *especially* after reworking the storage scheme (e.g. Bacula won’t descend into mounted filesystems by default, and moving data to a separate filesystem might also prevent it from being backed up), is really worth the trouble. Backups are sysadmin’s children: he tends for them, then they tend for him…;

Well, I have Used Ubuntu for a few years and mine (among many) was I was trying To add a repository for my anti-virus and I added the repository for all of the Debian volatile project. It was a huge update and then nothing when i restarted.

My most remarkable command line mistake happened sometime in the eighties. I didn’t know there was this thing called “Usenet”. Trying to delete some file, I accidentally typed rn instead of rm. That mistake cost me countless hours over the next ten years or so.

I was going to wipe a little USB-stick and did dd if=/dev/urandom of=/dev/sde on the production server. Ops!

Lesson learned: Use very obvious, colorful, different prompts for each system. At the time I had the same .bashrc on multiple systems.

Another mistake I did was to add “exit 0″ in some function in my .bashrc, when I really wanted a “return 0″. It took me some time to realize why I was instantly kicked off everytime I was logging in via SSH. Haha!

Wasn’t bad service wise , was a weblogic instance running in a massive cluster of about 10 instances. The members just dropped it from the cluster when the hostname changed to -f. I noticed immediately.. lol naturally because hostname returned nothing,, it was a quick puzzled feeling, then one of the “oh crap” feelings, =). Luckily that client (no names) had the shell prompt configured with hostname, and I could easily see what i needed to change it back to by looking at the previous commands I had run.

Embarassing wise, super, about 20 minutes after I had “cleaned up the issue” , I thought it just looked like a blip and no one had noticed. Which in general was true, the client hadn’t noticed it at all, however a senior of mine and some what mentor, walked over to me, pulled over (rolled over) an office chair, sat down, looked left, looked right, ensured we weren’t on attention lane, and said in a strangely nice yet taunting way, “so… (long pause) dont always run commands as root… (another long pause) … and dont do hostname -f on solaris.” He then got up, and walked off back to his office.

LMAO,,, _fail_…

What did i learn?

Ofcourse one would think the point is dont run hostname -f on solaris. But actually I learned from this and from the years, dont abuse root or super user access. Take advantage of proper permission usage.

My best mistake while learning Linux at work was accidently deleting a config file in sites-enabled using winscp! Straight away the clients website was unavailable and I had to put my thinking cap to good use. So I managed to copy the contents of another file and reconfigure. It was a tough learning stage, the power of Root should never be underestimated!

Hi all, So funny ! My vote goes to: – typing in the wrong console (i rebooted the main oracle DB production server once … all the operators where coming to tell me there was a problem with the DB … oups .

My second favorit : the rm -fr /stuff * (with the space before *) … Even been carefull after the first time i made it about 8 years ago … i reach to do it 3 more times … in production environement! Sure your happy to have some backups then !

Got into work one morning.. Logged into the main production server (at another site), and found a new directory:

/aaaaaaaaaaaarghhh_dont_delete_me

Called the sysadmin there asking about it, and had the reply:

“Oh, you found it then!”

Apparently, he’s been removing a user from the system the previous evening and did a:

rm -rf / home/username (yes, with a space between / and home) Fortunately, since we were using amanda, there was a /amanda directory full of backup related files, which gave him a few seconds of Cntrl-C time :-)

Hence the new /aaaaaaaaaaaarghhh_dont_delete_me directory, full of small, random files, just to provide a few more seconds of Cntrl-C time should anyone repeat the command!

Oh, also did a ‘rm -rf /’ (knowingly!) on an old SunOS4 server that was being shipped offsite, we were rather disappointed to find that it thrashed it’s disks for half an hour before just returning to the ‘#’ prompt. I think the only useful thing we could do with it afterwards was ‘echo *’ :-)

That’s just happened to me last week! Granted it was not really my fault ;-)

At the staging area (where we prepare boxes) we have no KVM switch. Due to a stack of new hard disks, the keyboard needs to be placed one in front of another (instead of right in front of each monitor). I had earlier moved the Windows box’s keyboard in front. Without my knowledge, when I was out my colleague switched the keyboard to do some Linux stuffs. I went back, and tried to log on to Windows. A yell from behind me confirmed that I’ve done something terrible…

… lucky it was still being installed, so nothing important is gone. The inconvenience of having to reinstall was promptly forgotten when I treat her to a nice dinner ;-)

KVM switch is danger thing. I had connected 2 x Linux and 2 x Windows to one KVM. The screen was blank. Just hitted CTRL+ALT+DEL becouse wanted to login to Windows machine… … caused immediate reboot of main Linux router. Fortunately, it tooks only 2 minutes to boot up.

Forgetting sudo on the first run has kept me from doing some really stupid stuff, like removing my /etc folder or similar. Damn was I glad to see that “Permission denied” after rereading the command and noticing my (almost) fatal mistake.

This reminds me of when I told a friend a way to auto-log out on login (many ways but this would be more obscure). He then told someone who was “annoying” him to try it on his shell. End result was this person was furious. Quite so. And although I don’t find it so funny now (keyword not as – I still think it’s amusing), I found it hilarious then (hey, was young and obnoxious as can be!).

The command, for what its worth :

echo “PS1=`kill -9 0`” >> ~/.bash_profile

Yes, that’s setting the prompt to run the command : kill -9 0 upon sourcing of ~/.bash_profile which means kill that shell. Bad idea!

I don’t even remember what inspired me to think of that command as this was years and years ago. However, it does bring up an important point :

Word of the wise : if you do not know what a command does, don’t run it! Amazing how many fail that one…

Hmm. Note to self : be careful which one you reply to (noscript might have had a play in this). To the op – if you would delete this, I’d appreciate it as i put it in the right section now (and delete this too, obviously).

I love that one too. But I wonder, if you’re looking for a certain list, would it not be OK to use something like (I don’t know what language you were using but point is the same) cscope or some such ? Of course, I actually use grep a lot when programming so I could see that being why you did it this way (though for me it isn’t so much compiling a list of, but where. In that case cscope might be more useful but only if I have one shell open and don’t want to rely on, say, multi-tabbed vim or screen. Still, it’s similar to compiling a list and so I’m not suggesting anything and it is besides the point). On this mistake: it of course can happen with other commands. E.g., try sed without (-i) on a file and directing to the file, to (try) to update the file in place. Same with other commands. It makes sense though. Either way, the result is you might not be too happy, indeed, unless you have a backup. Of course, if it is source tree, one might have more luck at least if it’s a project of much worth (and they haven’t made many working copy revisions), since revision control. But yes, agreed: this mistake is quite fun. I think I’ve done it but not on a file that was important and never one that I couldn’t restore. Still, those who are afraid of > and >> (Hopefully that came out OK) – using them or otherwise – are basically afraid of learning and in the end will not be as efficient as they could be.

(Yes, I’m going through some posts today out of boredom and I’m going to offer some mistakes I recently made, after that, so for those subscribed to this post, I am sorry for the several messages).

I don’t see how the commands would do any harm, unless there are some versions of chown and chmod that don’t require a file when specifying recursion (as -R). chmod -R 400 or whatever permissions (and same without -R) by itself, with out any file (or files) will just give an error. Is there actually a version that is that naive, as to assume you mean the current working directory (or was this a typo on your part, perhaps? Maybe you meant ‘.’ at the end of the command)? (One can hope that if there is, it is very careful about the parent directory…)

Hey, now that was a tip to get into this list with yet another story ;-)

/home holds homedirs, but usually isn’t any user’s homedir itself. Tampering with its permissions might feel very wrong with other users.

2 kalyani: do “cd; ls” — plain “cd” is equivalent to “cd ~” or “cd $HOME”, and this isn’t a place to ask (or give) advice as I was already kindly pointed at — head over to linuxquestions.org or whatever forum suits you better. See also e.g. http://www.tuxfiles.org/linuxhelp/cli.html (googled up as “linux command prompt howto”).

PS: anti-offtopic: my latest colocation visit (after IP-KVM session but no remote boot media available) was due to hda->sda issue (2.6.18 would still use legacy IDE driver for an IDE CF root device, while 2.6.32 went libata); I did it “the smart way” and replaced “boot=/dev/hda” value with whatever UUID “blkid /dev/sda1″ returned earlier while modifying /etc/fstab for old-or-new kernel setup. The thing is, one doesn’t really want a bootloader in _partition_ when MBR isn’t prepared to boot off that partition. And I managed to get that MBR LILO to L 99 99 state when the backout entry wouldn’t be of any use…

Best one yet, I had setted up a firewall at with ipfire the informatic club of my CEGEP and found out that my dhcp server would end it’s leases every 2 hours. I then decided to set the max lease time to 0, thinking that it would remove any limits. It didn’t take long to find out that it was releasing new ip adresses to the computers on the local network on every second, rendering it inaccessible. I had to go directly on the machine to change it.

The worst thing I ever made caused a disaster. There was one, single SATA in the server for few users, where I was thinking it was proffesional SCSI (same name: sda, 1st mistake). It was containing Samba with roaming profiles on reiserfs partition. One day one user asked is it possible to recover deleted file. Answer was: maybe. I found command to rebuild reiserfs file structure, which can help recover data, something like this: reiserfsck –rebuild-tree -S -l /root/recovery.log /dev/hda3 where /dev/hda3 was /home. Unmounted /home and did it. Unfortunately, there was a BAD SECTOR on the drive (I didn’t know that) which caused command to interrupt. All data went gone, including all raming profile’s data from stations next day. No recovery possible. No backup.

Backup and RAID are orthogonal: one will lose data added after the last backup if the only disk (or degraded RAID) goes down or a filesystem gets seriously corrupted.

BTW my another similar fault was trying to mess with a damaged filesystem on its native drive (which developed read troubles) and not on a _copy of_ a copy of that block device… that sort of extra duplication very much pays off when you badly wish to get two minutes back in time when data was still relatively close at hand.

And it wasn’t until another colleague asked me to restore some file from backup that I’ve realised that git backups effectively stopped the day we did the move (something like a month ago back then).

Bacula honestly warned in reports that it didn’t descend in a mounted filesystem, and docs stated clearly that one should explicitly either specify such filesystems, or tell it to cross mount points (and care for /proc and friends on his own).

Mind you, it was proper hardware, mirrored disks, scheduled downtime, an extra copy taken just in case — but yet another gotcha and only myself to blame.

Lesson learned: don’t only backup, do verify what is extractable. Especially after storage-related changes.

Quite a long time ago, I discovered once the magic of /etc/inittab, in the “default” line. I changed to ‘s’ for “single user mode by default” and did what I had to do, reverted to the “norrmal” mode, typing “m” for “multi-user mode”, then rebooted the server. Since no line matched this strange code ‘m’, it was just starting NO tty. Hopefully, I had a similar server 10 miles away where I could make a bootable tape…

how can i configure a local dns server in my home pc where i use ubuntu 10.10? any give the brief funda of naming a server like(127.0.0.3)? why we use this kind of naming system?what r the facilities and why? tell me how i can configure a ip address? n how can i create a lan comnnection with tectniacal terms n eclanation? one more qsn is for any of the above qsn understanding of .NET language is it essential??????n books name where u got that n how u learn that?i want to learn all that procedure?????????

I make this one more often than I’d consider ok, so bonus hint is “DNS HOWTO”.

Re .NET, I’m happy to _not_ have made a mistake starting with something but LISP. One might enjoy SICP book, see http://mitpress.mit.edu/sicp/ (it also happens to be one of top google results for “sicp”, incidentally).

As is already noted or referred to: writing to /dev/null writes nothing to nothing (essentially that’s the end result). But even then, not only would > imply the file would exist after (even if empty – which /dev/null is anyway), >> would just append instead of overwrite. But this means the file exists. And hey, look at this:

# mknod -m666 /dev/null c 1 3 mknod: `/dev/null': File exists

Now, if you were to actually rm the device, that’s another story. And in that case, you could just use mknod to set mode. Also, if you have selinux enabled you’ll need to do more or else you’ll get denials.

Few years ago, friend of mine accidentaly did another, funny thing. He connected one port from Netgear 24-port industry-entry switch to another port on the same switch. In result, this caused route loop, pushing packets to go over and over in the same switch rendering all network devices inoperate within minutes (!) This also can happend when “smart” switch uses LACP and 2 or more cables to connect to another switch, when it doesn’t have function like “IEEE 802.1D Spanning Tree Protocol (STP): provides redundant links while preventing network loops”, You forgot to save the settings and switch has been restarted… so always remember to click damn SAVE before You get out of the work!

Same thing happened to me few years ago, when a tech guy from Telco company connected same cable on the same switch within my client’s VDSL network causing so called “broadcast storm”. He’s explanation was that he intented to made VDSL modems work constantly good by forcing them into broadcast to speed up VDSL network based on analog telephone lines :). When we put firewall device on front of the VDSL network, firewall device was rendered unusable within minute by detecting some kind of DoS attacks and bringing down ethernet ports, both LAN and WAN. Investigation took us a day to find segment (switch) that is causing problems.

I know that this post is based upon Unix-Linux mistakes, my senseirly appologies, but it is usefull to know what kind of weird problems can prevent network from operating normal.

After entire night spent at client’s site doing migration of Oracle DB from W2K3 box to RHEL 5.5 box (with Jboss on it), arround 06:00 i’ve just wanted to delete some symlink and ended up doing something like this “rm -rf /etc/* /some_sym_link (notice blank space). I wasn’t sure what i wanted to do with this command because i was to tired. It was production Jboss server with no backup and it should run Oracle too. Luckilly i had still W2K3 box to continue work with no downtime. Within an hour everything was OK on W2K3 box, but still needed to reinstall Linux box with Oracle and Jboss. It happened 3 months ago.

Guys do not force Unix or Linux box to do S.M.A.R.T. checks unless you are 100% sure what you are doing. Few years ago i forced Unix or perhaps Linux box, cannt remember, to do that and it rendered system disk unusable within two weeks. There are conf file which can be used to configure such option but that option should be enabled in BIOS also. Forgot to do that and ended up with problem :).

I’m here again. Just lerned something new. Be careful for UUIDs of Your volumes and LVM2 snapshots. Recently happend to me that both: original partiton and its snapshot has been mounted at same location!! /etc/fstab entry was: UUID=123456….. /var/lib/libvirt ext4 defaults …. but also there was entry in /etc/rc.local mount -t ext4 UUID=123456….. /var/lib/libvirt this should just mount same partition twice. But this not happend. First mounted was snapshot, I think. Luckily, the LAST mounted partition was original /var/lib/libvirt and then Libvirtd started. I was wondering long time am I going to loose data after delete of snapshot… no file size increase, becouse of RAW format already filled… but checked modification date and time of files when remounted to separate temporary folders. This happend on Ubuntu Server 10.x in test env. By the way, anyone know why read/write snapshoted LVM volume is REALLY slow?

I was checking whether is possible to recover files which got deleted using “rm -rf *” command. i was in my home folder and i created a test folder and i created two test files inside the test folder(Actually after creating the test folder i forgot to change the directory, so i was still in home folder). And i issued the command “rm -rf *”. Silly mistake… :)

I’ve accidentally run commands on a remote shell instead of local machine more times than I care to admit.

Ive learned the hard way to be VERY careful when using chmod -R, and rm -rf. Thankfully its never happened on a production server, although I have annoyed a friend or two who have asked for help setting up servers.

But the most outstanding i had was initiated by a buggy KVM switch: We had 3 identical Debian Lenny server in production (redundant+load-balancing DB servers) we wanted to dist-upgrade to Squeeze… We did it the obvious way, choosing a low-load hour, upgrading first one while keeping the two other in production… Well in fact that *** KVM switch actually broadcasted the keystrokes to the 3 servers. We noticed at the kernel update that all were rebooting, then we got phone calls for the service not being available.

Even more outstanding, is the dist-upgrade went equally well for the two other and as soon as they’d rebooted they were fully-functional… great testimony on the reliability and ease of upgrade you can expect from this distro, the first time i’ve seen such a potential big-disaster mishap turning to negligible impact unattended.

(lol, after the adrelanine rush when you’re waiting them to boot, you can enjoy the ineffable feeling of having succeed a flawless unaware upgrade of a full pool of production servers…)

has anyone else noticed that very few of the comments on this page use full English???? it’s scary…. and one made no logical sense until i imagined a few commas. also, i’ve done the remove everything …..

Capital ‘h’ for “Has”. One question mark is enough to denote a question. Capital ‘i’ for “It’s” since it is the beginning of a sentence. Three full stops ‘…’ denote an ellipsis not four. Capital ‘i’ for “I imagined…”. Capital ‘a’ for “also” as it’s the start of a sentence albeit a poor one. Capital ‘i’ for “I’ve” and another ellipsis with too many full stops/periods.

It’s crucial to stay aware of context! My coworker wanted to delete the contents of the current working directory, so he typed “rm *”, but he didn’t have the necessary permissions to remove those files as himself, so he moved on to:

su – [root password] rm *

There was just a little problem with doing that. The “su -” command moved him to the root directory, and shortly thereafter, he realized the system was no longer bootable.

Hm, why that when you have these options to rm (assuming you do; else ignore) ? -i prompt before every removal -I prompt once before removing more than three files, or when removing recursively. Less intrusive than -i, while still giving protection against most mistakes

Better is to explicitly type the option (e.g., get into habit) if you’re wanting to rely on -i for rm (or others). This way, if you do find a version of rm with no -i option, you’ll get an error (or should). And keep in mind that depending on where you put the alias definition, you might not always have it available (though in that case unless a command existed by the name ‘rim’ you’d indeed get an error). I think my point initially was this though: you should always explicitly specify options of this type, rather than relying on aliases as it can bite you at some point, in ways you might not foresee (including someone compromising your machine – or simply your account to e.g., to then go after root, gather information, whatever) and doing [whatever] to your alias [or whatever else] but at same time not doing anything else… or another possibility is if you are at work and leave yourself logged in they could do their deed that way, as has been done to many). Also, when specifying -i, keep in mind that rm -fi will prompt you, rm -i will prompt you but rm -if will not (and yes I am skipping the file as I’m referring to command and option only).

I was watching an admin deal with an interesting problem. We had a file named ” ” (space) in the root directory which he wanted to delete…

…you’re way ahead of me.

Sure enough, he typed:

rm(space)-rf(space)/(space)(return)

This system was the last running of its kind in the company, we had no alternate software for it and no way to recover or boot it up again. This was in 1988 at a shop that specialised in doing its own ports of UNIX to just about any piece of hardware you gave it (“give us 7 days and we’ll have a single-user prompt on a crowbar.”). For this one, though, it was so ancient there was just no helping it…

One thing I stuffed up at work was dumping a database to text format, then reinserting the records. I think it was with postgresql a few years back. You would expect it could read its own SQL, but we were using a human-readable date format and the timezone was printed as ‘EST’, for Eastern Standard Time in Australia. However on re-importing the date, the DBMS interpreted it as EST in America, apparently. So, I learned always to use ISO-style date and time format, e.g. 2011-12-31 23:59:59. It sorts nicely too.

I’d like to propose a new direction for this thread… What, if anything, have we learned from all these many and varied UNIX disasters? Here’s my list so far:

1. Keep multiple remote versioned backups of anything you care about. 2. Keep multiple failover systems ready to replace any critical system in a moment. 3. If a process is complex or time consuming or error-prone, try to automate it. 4. Hire competent netops and coders, not the numbskulls who posted here ;) 5. Do not entrust any shell coding ‘toddler’ with root, DBA powers… or anything else.

Please tell me what you’ve learned too… or any other good advice for netops…

Do a search on the page for ‘learn’ or ‘lesson’ and you’ll quickly compile a list.

One of the simpler things is to make a backup of any file you edit before you edit, delete it when the edits test as good.

Test with ‘ls’ command before running the ‘rm’ command – the extra seconds is priceless

If you can’t replace it or rebuild it – think long and hard about why you can’t, why there is no backup, and just what in the fuck do you think you are doing by editing/changing anything on it. That goes double if it is a production box.

Seriously, why don’t you have a test system?

Why are you running a command which modifies or deletes anything if you don’t have a backout plan? Sorry, say that again? Are you sure you want to do that? NO, go to lunch and think about it. If it’s still a good idea, well then ok. Go ahead, run that command with no backout plan. It’s on your head.

As for your points, 1=yes, 2=yes, 3=YES, 4=ha ha, 5=hey, wait a minute. Even script gods have bad days, so don’t trust anyone to get it right. Severely restrict root access, DBA powers etc. to only those who must have them. There are some things which should not be at root access levels – if that is inconvenient, reassess permissions for it. If it should be root-access-only reassess why anyone needs access to it. Even if you are a script god/admin don’t do your daily work as root. Use sudo. Restrict root login to ONLY the console, no remote root access. Stuff the reboot command inside an alias script which prompts you with all the right information that you should know before rebooting before it runs the reboot command, such as server name, uptime, number of connected users etc. You might even restrict that command (and others) to a special user that you must sudo to first. On production systems, 15 seconds to access it before you run the reboot command can save your job for you.

Now, write a script to implement these changes you have made on any and all production servers. Implement them on all production servers. Prepare your USB controlled NERF missile launcher to destroy anyone who complains.

I LIKE my moniker. Apparently 4000+ others like it enough to use it also. If I managed to do something kind 30 years ago, I don’t remember it so I’ll not try taking credit for it. Of course, I can’t remember much of last week, so 40 years ago is going to be a stretch no matter how much red bull I drink. I guess I’m “an” Mr Z, probably not “THE” Mr Z, but he has a good name ;-)

Anyway, thank you! Being admin for years, I still find usefull things on this site, at least thank you for that. I understand what you’re saying. I can’t remember what I ate three days ago. Mr. Z I was looking for is related to Triad, Google knows, Google knows. I learned from one of his friends that he’s “working some Linux boxes with his father”, that’s why my neural network rang a bell. Best!

BTW, my comments to you are how I treat system administration. I always assume my assumptions are wrong… ba dump ba. Seriously I always check twice especially when I think I know what I’m doing. This applies to many things in life. If you are building some cabinets or such at home, measure 4 times, cut once. You only get to cut it too short one time. You can cut it too long several times. There is no substitute for a second opinion. Google et al are a wonderful source of second opinions, and like this site, those opinions are based on how to fuck it up. So read these and others and do not make those mistakes. Of course you will, we all will, but if this reduces the volume and regularity of your mistakes then it was well worth putting this blog online.

My own personal mistakes:

Sun hardware often has interlock switches on the covers. Don’t remove them to look inside of a production box that is on line.

TEST your backups. No, I really mean TEST your backups the hard way. This is what test servers are for. If you don’t know your backup works it WILL fail. While you are at it, try a little bit of disk-2-disk-2-tape backup. Multiple copies never hurts and available on network attached disk is a speedy recovery method. Did I mention that you should always have backups – plural. Disaster recovery is a life style, not a SLA requirement. Repeat that to yourself every morning. Whether you care for your customer’s data or not, you SHOULD care about your nights and weekends. Many of which you might spend in a blind panic if you lose your backups, or don’t have any to start with.

Oh, on the topic of backups – install a version control system. Use it for all your scripts. Trust me, you don’t want to rewrite them from scratch – ever. Build all your boxen so that they can be replaced in a heartbeat or as few heartbeats as you can manage. Seriously, remember the sanctity of free evenings and weekends at the bar with your friends. These will be violated regularly if you are not ready for disaster(s).

Another? High availability systems are worth less the the single router they are connected to. tut tut tut. Yep, I’ve done that one, or rather suffered it. The OSS group or Network group are NOT your friends. Make sure that disaster recovery is THEIR lifestyle too. Trust no one!

As I’m reading these mistakes, I remember making a funny one a few years ago. At university, we had our own server in school’s computer center with some “home-made” utilities, one of which being called “kick”. It served to kill (-9) all the processes of a specified user. This was achieved simply by setting a root suid attribute to the binary, calling setuid() syscall to the specified UID and executing killall -9 -1 to kill everything what kernel allowed it to. It had already been a long time since we created these utilities, partly forgetting its inner functioning, and we freshly introduced our “home-made” kernel security patch that was configured to make the setuid() and seteuid() syscalls ineffective (the syscalls returned no error, but process’s (E)UID remained unchanged) so as to disable any kind of root access via an exploited setuid program – to become root, one had to log in properly via ssh from one of the defined IP addresses. In addition, to make it more convenient for the roots to administer the machine, certain UIDs have had certain capabilities added, e.g. to delete some non-owned files from defined directories (not the system ones) or to kill not-owned processes. I was, of course, one of such users. That day, it was Friday evening, I was somewhere else, working remotely, and when finishing, I was too lazy to exit all my shells and other processes properly and wanted to show my friends a way how to log out both impressively and easily. So I issued kick beus and watched with surprise how all my friends’ screens (logged in on the same machine) went blank suddenly. Then I realized the truth of the matter: kick command did a setuid() syscall to a specified user UID which failed, and, as still having my capabilities of killing non-owned (including root’s) processes, the executed killall did its work properly and we ended up with the machine being dead for the whole weekend and Monday morning. Just the kernel and init survived so we could only watch it responding to ping ;-) Since then, we moved sshd into inittab and later introduced a remote-sysrq target into iptables :-)

When i was young and stupid i accidentaly run poweroff command on remote linux box. I wanted to shut-down my local machine, but forgot i’m logged on remote server. This was at evening just before i leaved to the pub. I have to investigate why this server is not running at the morning and i noticed shutdown sequence in log… From this time I’M NOT using poweroff command. I’m using “reboot” in every case and then shut down machine while rebooting by pressing power button. If i have to use poweroff command (eg. i really want to shutdown remote machine) i double check if i’m logged on right box.

My first, great, unforgettable mistake: first days in linux, run gparted in a have-absolutely-no-idea-of-what-I’m-doing mood. Just answered “Yes” without reading the warning text: that was the last time I saw a running Ubuntu, on reboot computer said “no partition table found”.

Second great mistake: first hours in zsh, totally amazed by its great tab completion system! its damned tab completion made me type “rm /bin/mount”.

I nearly forgot my third great mistake! I was having so much fun on the University gentoo server, and that days were my first very funny days on Apache, I was editing httpd.conf and I didn’t realize I had pressed return key, decommenting a help line… Server restarted and within minutes ALL WEB APPLICATIONS USED BY THOUSAND STUDENTS returned a very lolling SOCKET ERROR! ahahah

I’m on the phone with a lab tech at a large cable TV company. He’s got a runaway process on their video-on-demand server chewing up a CPU core.

me: It’s a non-critical process, so you just kill it and restart it. tech: How do I do that? me: Type “ps -ax | grep processname”. Find the PID. Then do a “kill -9 ” followed by that pid. tech: “kay eye ell ell dash nine” ok. Now what. me: It should restar… tech: Whoa. Uh. Hang on. Something else is going on here. I, uh, I gotta go. me: Ok. call me later when you have time.

An hour later…

tech: Don’t know what that was, but it’s fixed now. Ok, so I killed that process. me: Yeah? So check if it started back up now. tech: Ok. I remember the pid was “1”… me: hahaha. Wait. you are kidding, right? tech: me: O_o

One time, when compiling some of my code, I accidentally did gcc -o file1.c file2.c instead of gcc -o file.o file1.c file2.c :( Obviously, I was not happy after that.

I have also done the accidental “type command into wrong window” before, except I did it the other way around. I meant to shut down the remote computer (It was another workstation, not a production server), but I accidentally did the shutdown in a window for the local computer instead. So I was the only one that was bothered, but I had a LOT of stuff up that I was working on that wasn’t all saved. End users spared, but I was furious.

Somewhat similar, but not necessarily my own mistyped commands: I really like making programs for automation, so I was all too happy to make something when some of the staff here said “We’re tired of shutting down computers in this area manually every night after people leave.” I made them something that would shut down all the computers in their area, only to be used at the end of the day. It was called “SHUTDOWN LEARNING COMMONS” To this day, they still get a new person from time to time which runs it, wondering what it is for. They have shut their area down a few times when they had a lot of people doing work over there.

As an added bonus, I made the program initially so that it would run in the context of my personal admin account so that it had privilege to shut down the remote computers. The remote computers were windows xp, and the xp window that pops up warning users of the remote shutdown displays front and center what network account initiated the remote shutdown. I was told one day that someone had accidentally run it again, and that one of the more “important” users was searching for me to “give me a piece of his mind” because he saw my name attached to it, even though I was not even at work when it happened. Since then, I made an account called “shutdownLC” whose only purpose is for that one program.

Another time, network switch related rather than unix: I accidentally locked myself out of our core switch when I was changing vlan settings on it. This rendered our entire network basically useless. I know this is one of the more common mistakes, but here’s the worst part: It was an older switch that doesn’t have an RJ45 console maintenance port, so all of our serial cables with RJ45 connectors on the other end were useless for this switch. We had a few old ones in the building, but it had been so long since they had been used that nobody knew where they were and we couldn’t find them.

The network was down for a few hours before my boss made a special trip in and made me feel like an idiot after he got there: he unplugged the cables that went to the core switch and plugged them into a different switch then said “This might slow things down a bit, but at least it will work now.” I don’t know why I didn’t think of that…

Had to fix a box that was borked by an admin that (possibly) accidentally deleted /bin/grep the system wouldn’t boot. It was full of errors everywhere and the system was hanging. While troubleshooting in chroot environment that booted from a CD I tried to grep config files to find out to my surprise that grep wasn’t there (2-3 hours down the road of course).

Some filesystems has undelete utility. Just turn plug off the power cable not allowing disks to sync or write something, take a deep breath, grab some coffee and try to get data back. I learned this after destroying some data on different filesystems, ext2/3/4/reiserfs, even ntfs. The more data is written, the more fragmented filesystem is, the less chance is for recovery. Case is more difficult when using RAID, it usually needs backup system to another array or tape.

Adding a sudoer to a machine on which I had sudo rights, but not root:

# User privilege specification cschultz ALL=(ALL:ALL) ALL # this was already here mjagger ALL=(ALL:ALL) ALL # this was already here sadams ALL=(ALL:ALL) ALL # this was already here metoo ALL=(ALL;ALL) ALL # the semicolon broke sudo so # none of the admins can log in.

changed the network(eth0) configuration of around 100 servers while data center move and forgot changing the network ID – This was done through a script from a trusted host. Also issued ssh xyz `poweroff -f` on the admin host which killed the admin host itself.

Copied upto 50GB of data to /tmp as a temporary location(had planned to copy them to another location) and forgot. One reboot and the whole data is gone.

ran crle -l /some/path/to/libs on a Solaris production box and it messed the whole system. None of the other admins could fix it and we had to go for a reinstallation.

Excellent topic. Thank you for sharing with us. And thank you to all who left some more comments.

I think chown and chmod are by far worse then rm -rf. Because usually, you have backups, and you can easily restore a folder unless you deleted your entire file system. Fast and clean. But chown and chmod ? you’re good to make a full restore…. unless you want to lose some time believing that you may find a way to fix the mess…

Another one I did was a quite simple

apt-get autoremove

Well… It removed automatically alright…. REmoved dependencies for a uninstalled application…. but these dependencies were still required for quite a lot other packages. …. I still don’t know if it was supposed to do that… but I don’t use autoremove anymore …

I’m still new on Linux… Working sporadically on it since 3 or 4 years… Lots of mistake to come… for sure. I’ll keep this post in my bookmarks to come by sometimes and read again.

Not really a significant mistake in terms of downtime/destruction/damage/etc, but still. So I work for a hosting company, and some of our boxes have upwards of a thousand or so domains on them, including the Apache virtual hosts as well as named zones.

Well, one day I edited a zonefile and restarted named rather than reloading it. named takes *forever* to start up from scratch with that many zones so all the sites were unresolvable for around 7 or so minutes. (Note we don’t have a redundant/slave DNS set up, but we definitely should)

A bad mix of screen and rsync. Working on adding a slave database server to a staging environment. opened screen and a term to the existing database, and a term to the new host which was to be a slave. got my rsync command ready, and fired off,, only I was on the wrong DB and using –delete, so i wiped out the staging database in a matter of seconds. I never had so many perplexed developers calling me in my life. =), luckily staging is a dump from prod with some sanitizations run to fudge data so i simply needed to run the process manually. In the long run the dev’s weren’t that upset because i gave them a free day off. woohoo

Our Linux servers are -csh by default, because our production software uses csh environment variables. My workstation is bash by default (and preferred), so, upon making changes to the .login script, I use this: if [ $this = “$that” ] then # more bash syntax that killed the login

Luckily, not a huge mistake, but it did kill everyone’s ability to login. So, had to mount the home directory as a sftpfs, and fix .login.

No problem for a couple of weeks until I had to reboot my box. Next morning, looking at my completely empty home (except some RPMs…), I realized that i forgot to add an entry for /home/lala/dump and a “&&” instead of “;” in the cron job would have been a nice to have.

Starting to fill up the hard drive on my development machine…. So..company buys me a nice shiny new drive, Install it late the night before, get up early to get some extra work done because I know it is going to take a while to set it all up.

mkfs the first partition on the second drive…..or… is it the second partition on the first drive? …OH crap!

Good news…there is a full backup on tape.. Bad news..it’s offsite, half hour out, half hour back, and 3 hours to restore it, *after* I rebuild a bootable system to restore it to…

Digging up old thread, but I have a story that I thought would be more common:

I needed to create an account on the production database server. The admin trusted me, so logged in as root and let me create the account while he fetched coffee. It was running an old version of MySQL and was being fussy with the GRANT command so I just added the row directly to the mysql user’s table. Problem was I got the password wrong, so I thought, no problem, I’ll just update it: UPDATE user SET `password` = Password(‘mypass’);

I got distracted by remembering to hash the password, and by reflex hit the semicolon [enter] before catching myself.

You know that moment when you go cold and realise you have just majorly messed up a production server that isn’t even your own and you really ought not to have been on in the first place? Fortunately I had done a select a few lines before and it was in my terminal scrollback. I had to manually enter in the 30 odd user password hashes before the admin got back with his coffee. I never told him.

I have now got in the habit of typing in “\c” and then only after I have finished the whole command and reread it do I remove the \c and add a semicolon.

My personal favourite is accidentally copying the default tcsh/csh prompt from the command line and pasting it. (ya.. i know you can just browse the history, but i think i was working with multiple terminals and trying the command line out in a different shell environment)

I’ve done most all the mistakes mentioned here… for me, the worst was when early in my career I was working in a pre-production environment on a customer site, our “pilot” machine shows off our custom software to customer mgt and execs, lets them demo the software/provide feedback, that kind of thing… it’s hugely visible and must have constant availability. On this project, it’s the only one of its kind at the time due to budget constraints.

So, we spend almost a year developing the software for this particular project and had gone through a couple of rounds of feedback, software updates, etc for the customer to see, play with, approve, etc.

One day I’m on the box in the middle of the day prep’ing for an update later that night and instead of removing the contents of an old version’s backup directory, I run this at root

rm -rf *

and go back to my work, waiting for the command to finish. Only when the command was still running 10 minutes later did I realize what was going on. I stopped the rm command, but the damage was done. The box was hosed when it was supposed to have 24/7 availability.

I happened to be working with a great guy at the time that was superior with Unix. He did some command line magic using cpio, some pipes, and some other stuff to this day I still don’t know what he did that in real-time over the network re-built the the machine, software and all in a few hours. No one knew what happened… we took some heat for the downtime, but blamed it on something else I don’t remember.

I always always always have backups now… local backups, as well as remote backups.. and whenever I design systems, at least N+1 redundancy.

I was in charge if Usenet for a pretty big ISP. We had two identical Usenet servers, huge systems for the time, dual Xeons with hundreds of gigabytes of storage and as much ram as the rest of the machine room combined. We’d test upgrades on one, get it all working, then change the IP and swap the cables so that the test box became the production box. Well, one day after a successful upgrade I issued the command to remove and rebuild the news spool on the test box. Of course, I was *actually* logged into the identical-in-appearance production system. Tech support started getting calls 5 seconds after I hit enter. The moral: at least change the prompts or something to be different on test/production…

Another time I worked for a bunch of scientists and wasn’t “really” the admin. My boss was the sysadmin and I was the lab assistant, although he didn’t know much about running a computer so I did everything. But he insisted on asserting himself occasionally just for appearances and one day informed me that he’d added a cron job to “clean up” during the night. Next day we came in to a server filesystem that was completely empty except for /bin/crond and /bin/rm… The guy had written a shell that ended with (cd $HOME;rm -rf *). Of course, in those days cron ran as root… And root’s $HOME used to be /, not /root :-) Whoops.

I’m not a sysadmin, but am an app admin. When we did a major software platform upgrade on the same hardware I had to get the new code deployed. It was one night before the release and I had to get the deploy script kicked off but it was late and I needed to get home. Screen wasn’t installed and nohup didn’t always work. So I ended up putting it in cron. Then I got sick and was out for several days which was the major release. Everyone kept wondering why one of the clusters kept going down at the same time each day and reverting out code.

It seems to be possible to comment a working command in hp-ux crontab, creating a syntax error, and thereby disappear crontab on exit. I always make a manual copy now, as well as auto-copy off the box periodically along with /usr/local/bin and such.

I forget the exact sequence and not about to experiment, but was trying to upgrade the libc binary on a live machine to gcc3 to compile, I think I just moved libc.so.6 to libc.so.6.old… and ln, mv, cp, etc rely on libc.so, so wouldn’t work anymore. I did manage to google a solution to get the machine going again.

crontab -r just has no reason to exist. open the cron file and delete everything if that’s what you want to do. I had a bad experience, and then could never remember if the bad one was -e (erase) or -r (remove). (-e is edit, -r is remove)

I am beginner in Linux administration and.. accidentelly I have deleted anaconda-ks.cfg and install logs under /root/ directory:

1. I unzip files with overwride option using ssh 2. I press refresh button in WinSCP to make sure that the files are overwritten 3. I did not realise that WinSCP have changed the directory after refresh :( 4. I found that file anaconda-ks.cfg and two more install logs are not expected and impulsive have deleted those files. 5. I minute later I found that this is mistake – in result there is no install history and template for future installations / possible some company scripts rely on this file.

Conclusion:

Make backups of all important; Make sure that you are working in the right directory; Focus your attention on one console and avoid guis for installation tasks;

Virtual Machines have helped with that. With VSphere/ESXi when I mess up my ssh configs, I can just find the windows machine )that is the only OS that runs the VSphere client) and go in the back way to the server – fix my error and go happily on my way. One time I accidentally used a semicolon instead of a colon in the sudoers file when adding a new admin to a client’s machine. The machine happened to be 1000 miles away in a data center I do not have access to otherwise. Since it was not one of my own VMs, I had to request a colleague in New England go in and fix my error. It does make one feel human. Cheers, Wolf

[Please feel free to delete this, in parts, or even totally, btw, if it’s too verbose.]

Fortunately, the only semi-serious goof I ever made to a system not my own was to drop the backup Flexowriter console typewriter at the early (above-ground) NORAD/BMEWS COC. I was fired. Should have asked for help lifting it off the floor. Hit cork tile over concrete. I also wiped some customer data (not important) from a mag. drum-based data-storage device (SEMA, quite obscure) that I was trying to interface to an IBM electromechanical tab-card machine with lots of internal sparking contacts. They weren’t really concerned.

ThinkGeeks, iirc, is where you get T-shirts that say “I void warranties”. While that applies to me also, mine should say, “I disable operating systems (Only my own)”

At least three times, I have destroyed my Linux installations (and one Win XP; sorry). Being over-tired and doing something risky is just stupid, when one is free to hit the pad.

Linspire was a lovely Linux distro., very nice user community. It faded when its developer died, and his son really didn’t want to continue. I hosed mine via emelFM, a twin-pane file manager, new to me. Had one pane up showing /, the other /home. Wanted to recursively ensure that all of /home was owned by {user}, not root. Chose wrong pane, because UI didn’t really make clear which pane was active. I forget details, but permissions for the system were hopelessly hosed. Backups? Good idea, but not yet…

I think it was XP; I had about six partitions, and some unallocated space. Made a partition there to put Linux into. Did a mkfs.ext3fs on what I thought was the free partition. Unfortunately, partition “index” numbers (such as /dev/sda2) don’t necessarily follow the physical sequence on the platters. That can be an horrible “gotcha!” I got my ext3fs, OK, but XP was puzzled by the leftover hybrid partition type.. Backups? Not yet… (Most of the data is probably between the superblocks and inodes, still recoverable.)

While I like the Unix philosophy of avoiding user feedback unless wanted (I’m 75, and have some feel for the command line on, say, a DECwriter), I do think it shouldn’t be that difficult for the (GNU) mkfs commands to warn the user that a partition about to be formatted contains data. Afaik, # rm -rf actually does ask whether you’re sure, at least in some distros.

I tend to delete Bash history any time I’ve used an rm command that could cause future grief. All too easy to poke the up-arrow and hit Enter.

Discovered Parted Magic, iirc; small bootable CD, actually a workable small distro in its own right. Wanted to see what the GUI for deleting the partition table looked like. OK, so /that’s/ what it looks like. Over-tired; clicked on [execute] instead of [dismiss] (actual names differ). Should clone that HD and try C. Grenier’s [testdisk].

OK, although have long believed in backups, and too rarely done them, decided to back up openSUSE 11.1. Real fool, doing it in a very makeshift fashion (I know better). Krusader, twin-pane file mgr. Active. F5 is Copy, F6 is Move, iirc. Meant to hit F5, hit F6, instead. Watched progress, and suddenly files in / started to disappear.

Last, had several 100 GB of mostly non-valuable videos and stills on a 1-TB HD. Hosed it (tried to expand a partition, but machine timed out and shut down during resizing). Used the wrong recovery app. written by Chr. Grenier with no luck, of course. Wanted [photorec], but was trying to do it with [testdisk]. No luck; segfaulted testdisk once. Finally tried # dd if=/dev/null of=/dev/sdb BS=1M (details from memory, maybe not totally correct). Took over 3 hours, and, this time, I had disabled machine shutdown after 2 hours; set it to “never”. Only then did I realize that I should have used [photorec].

It’s an alias. rm -i does that. Do be careful you don’t rely on that though, because when the alias isn’t there.. well you can be surprised.

As for mkfs – define “data”. Best option for future is to (of course make sure its not mounted too – may give notice on that but not done it manually in some long while) check disk free on that mount point, e.g.:

df -h /

would show how much is in use / free on the root volume. (that’s the other thing. if a partition is not mounted, nothing will be seen data wise – not at the higher level anyway).

And for backup, depending on size of what needs backup, a few thoughts: cron job that backs up nightly/weekly, whatever ? (or even something like the program bacula). Also, you can save a dump of the partition table, boot sector and so on (not useful for everything, however it can be of use at times).

One other thing about backups : when recovering data, never write to the disk its on, including restoring or installing the program to recover with to it.

And on another note, regarding permissions. You could always do:

chown -vR user.group /home/ or chown -vR user:group /home/

Just make sure you never do chown -Rv user.group .* or similar (in general be very careful with recursively changing ownership).

Working for a hosting company has given me some interesting experiences.

Like the time a customer support agent gave a customer the instruction (via email) to simply execute on his CentOS box [rpm -e XXX]. Apparently the customer didn’t understand so the customer support agent typed in [rpm -e ] and then pasted [rpm -e XXX]. I spent well over 6 hours getting RPM functional again and I can assure you it is not a fun experience.

Or the entertaining situation created when a customer support agent wanted to “grow” an existing partition for a customer on a second drive by creating a new partition with the same name and then mounted it over the existing partition. Nothing was lost, but you should have heard the customer scream about his “data loss”. Luckily, mounting a partition in this fashion doesn’t erase data but simply hides it – yet it still took several of us a couple of hours to figure out what had actually transpired…

SQL: DELETE FROM some_table; — missing WHERE clause. Bun fight for the next 3 hours for the trained monkeys at the client to get the backup tapes for a restore.

There is a Linux installation where almost EVERYONE has the same home directory (no need to store user-specific stuff on this machine, they all share the same .profile). From force of habit: userdel -r minion; ## because minion had left the company. -r wiped out the shared home dir for all users, including all the DBF files for the production Oracle database. About a day’s downtime.

The same guy made a second mistake: logged in remotely to a Unix server, he goes: ifdown eth0; ## then intended to ifup eth0; to pick up a config change. But he couldn’t get his prompt back after the first command, I wonder why ;-) He rang up and told someone to force the power off, then on.

An entire group of suburbs without phone service for a day: some vague rumour about something misconfigured.

One that I did myself: very early days of Unix – confusing rm instead of mv. It was only a single file deleted (the intended destination didn’t exist, and rm croaked with the error, which alerted me to the mistake). But it was still 2 hours recreating the C-program assignment from memory – some one else’s who I was helping.

Another one I did myself: swapping between the CVS storage area, and the equivalent working directory in my home directory: wanted to delete part of the working directory tree and start over: rm -fr stem_of_tree; ## but in the CVS area, not mine. Got someone to restore from backups. It was a pretty rudimentary CVS setup.

rm -i ## at least when you have your L-plates on (L-plates is Aus slang for “learner driver”).

Only execute something as root if it won’t work otherwise; I generally have two windows open: my normal user and the root user, and I do things like ‘ls, cat, less’ in the normal window. Then I do things like vi, or rm as root (but I do a ls just before the rm, as root).

I configure my root windows with a different background colour.

pwd; prior to rm or recursive commands.

prior to a recursive command on a small directory tree, find . -print; or, find . -type d -print; see how many files will be affected (or for slightly larger situations, the second one gives the range of directories to be affected).

When I write a script to rename files or change them in a robotic manner, I put an echo in front of the intended command, so it will give me the commands that are going to happen. If satisfied, I remove the echo, to let the command actually happen.

If I’m editing files in a robotic manner (by script), I create the new files with a backup extension (eg file.cxx becomes file.cxx.new). Then I diff the various original files against their new ones, to ensure it is as intended, then I rename the files, mv -i ; and keep my eyes peeled as I say yes to each one.

The > is your enemy. If I want to create a brand new file by redirection, I … cat brand_new_file; (expect a not found error), then some-cmd > brand_new_file;

If I want to append by redirection, I make a backup of the target file regardless.

When I edit a config file in /etc (or anywhere), I … cp -i file.conf file.conf.20120116; named after today’s date. The dates aren’t that truthful, but it’s miles better than anything.

Create dedicated users for running VMs and similar (especially if they are shared). That’s the approach Apache uses (usually automated).

- Adding echo before the actual effective command when making any automated file-manipulation script and only removing it after several satisfactory test-runs is exactly my approach, too :) – If creating a brand-new-file from redirection, I use the advantage of tabkey completion ;) – When editing config files, I usually just copy & comment out the original line(s) and then edit – it saves time spent by diff-ing the versions. This way I sometimes get configs a few times larger than they have to be, but an occasional cleanup of unnecessary stuff during the next edit takes care of that :-) – As to the root access – I never have any root-shell window at all and do everything under my user. If anything really needs root privileges, on machines not needing top security I avoid having to type the root/sudo password each time by just running “do command parameters” where “do” is a tiny hyper-simple program in my ~/bin (which I have in $PATH) running the rest of its commandline under root privileges (has chown root:special_group and chmod 4510 (r-s–x—) ), basically going like this:

setuid(geteuid());
execvp(argv[1],argv+1);

Mine is extended by running the shell if run without parameters but I almost never leave such a shell open unused. I even used this on some servers – being executable by only a certain group together with hiding it in odd locations provided high enough level of security (even if someone gained an access to one of the root-allowed users’ accounts, would have to know what to run). – Anyway, what contributes most to the commandline safety is the distinction in prompt – I always set it up to show hostname[tty]:dir> for me and root@hostname[tty]:dir> for root, sometimes in different colour. This way I’m always warned that it’s a root-shell before issuing a command. Benefit of always seeing the hostname and directory prior to even starting to type a command or manipulating files does not need any explanation (or if hostname is not distinguishing enough, just use any string to identify the machine) and knowing the tty proved very useful when hunting processes. So, the prompt consumes the whole line sometimes (in some deep subdir trees) but the information it provides is worth that space, and it’s more convenient and safe than to check for “pwd” or “hostname” each time, what you can forget sometimes (especially when tired, what is the worst time to run dangerous commands) :)

“Mount a partition and have a nosey prior to hosing it (sometimes it won’t work coz it’s never had anything in it, but still: if the mount unexectedly works …)”

Good point, which saved my skin a few times… “I’ll format that for the new backup partition. Well, I’ll just check if there’s anything on it and… wait, isn’t that my Windows partition? That’s not possible, because /dev/sdb is supposed to … oh, I plugged in that thing and… yikes.”

I once untarred an archive in the root dir. This was long ago, on HP-UX 8. It changed the permissions of the root (/) to 700. Since I was logged on as root, I saw no change in behavior. However, everyone else began seeing very strange behavior, and no one else could log on at all (it would seemingly log them on, then drop them). It took about 20 minutes of looking, before finding the problem.

I did the killall thing once long ago on Solaris, too. It taught me to switch to using pkill everywhere (that had it).

Can’t tell you how many times I installed Solaris in the 90s with a fresh install, and then forgot to enable nis, and had to drive back to the data center afterwards because you couldn’t log in as root in telnet.

You are currently in your systems root directory (/). You want to delete all files in /tmp but typ “rm -rf /tmp/ *” (Note the space between / and *) After the command completed successfully, you try to change to another directory. But you can’t. There is nothing! It doesn’t happen to me yet… but hey, the life can be long. :D

Another mistake which is possible: You are on a remote machine (via SSH) and think you are on your local machine. You want to re-partition your hard disk…

is _NOT_ as devastating as many think here it is (I would add that as unix certification exam question). Let me re- post here what I already posted:

“… One thing I couldn’t buy as ultimately devastating though: rm -rf / – if I ever manage to do it as root on my *nix box I expect /bin, /boot, and part of /dev gone (and whatever else could be in / alphabetically before /dev on that box). Then the device hosting root filesystem will be deleted, and this will be end of my trouble. The rest: /home, /lib, /lib64, /sbin, /tmp, /usr, /var will stay intact. Other opinions?

Of course, you loose the system on the fly, but you can mount the partitions of the drive on another system, and you will see other filesystems mentioned above intact.

The only problem with the above logic is what happends when someone lets the command run to completion. (I have seen exactly that scenario happen when a young sysadmin failed to realise in time what he had done)

Anyway. My list Running comarnds on the wrong box (done) Locking my self out of a box with a firewall rule (done) ifdown eth0 when logged in on this NIC via ssh (done) chown apache:apache -R / var/www/html/* (done) dd if=/dev/zero of=/dev/sdb (when it should have been /dev/sda)

I am sure there are many more that I can not remember off the top of my head. I remember these because they were the most painfull. :(

> Locking my self out of a box with a firewall rule (done) >ifdown eth0 when logged in on this NIC via ssh (done)

These are my favorite ones on very remote systems. ;-)

1 Set a reload of the firewall in the crontab 25 mins in the future 2 Set a reboot in the crontab 30 mins in the future. 3 Set an alarm 23 mins in the future on your desk. 3 Do changes only on the command line first and when they really really work, modify the config file.

i was doing a database activity . i rarely use window machine .that night i was using windows and login using putty. i was just searching history where i found /etc/init.d/mysql stop i select that command and again start searching history even i dont know what i was searching then suddenly i press right mouse . in putty doing so is executing the selected command and within no time the database was down that was production database..

Re: Imagine the efforts to delete, rename, move or something that file!

Like this ?

$ touch \* $ rm -i \* rm: remove regular empty file `*’? y

As I recall there was always a way to remove/etc a file by that name. There’s also similar tricks if a file – exists, e.g., most programs (gnu versions at least, cannot check others nowadays) allow — to signify end of – options.

Regarding funky characters, sometimes you can use a ? in place of the problematic character, or use some other substring that will match. For instance, if you wanted to remove a file called “~*_blah” you could “rm ./??_blah”

You should be very careful with ? though. Reasons – although it won’t match a ‘.’ at the beginning, it would match the dot for file.ext if you specified, file?ext or a second dot.

That may not seem like a problem, but imagine this:

touch \.\? ( create a file called .? )

rm -f ?? (won’t delete the file, would delete files [that you have permission to delete] that have two character names; and if you specify -r it would do same for directories too, minus ..).

Sure, you can do ./.? but… hope you don’t do rm -rf without escaping the ?

Contrived or not, it’s best to escape characters or make absolutely sure it won’t do anything else other than you intend (which you even mention in a different post). Okay, so if you have a letter after, fine, but then you might be tempted to specify .?* and there’s another issue there. So yes, it can be done, but it should be pointed out it is risky.

Then there was the time that I wanted to use the tcsh shell for root login, instead of the standard bash shell. So I carefully edited /etc/passwd and changed the relevant part to “/bin/ctsh” instead of “/bin/tcsh”. Guess what happened next time I wanted to log in as root …

My suggestion: *NEVER* close you current opened shell before testing the new configuration in another instance. That also helps while configuring the SSH daemon. ;-)

Another mistake I did in the past but: Changing the SSH port without updating the firewall rules. :D I ended with a real headless system. I was happy about the possibility to use a VNC remote console. ;-)

Something I’ve done in the past as a safety net for firewall edits (which isn’t fool proof but worked for me) is to open a screen session on the server whose firewall you’re modifying, then do “ssh -R 29922:localhost:22 some.other.server”, then if the firewall prevents inbound connections you can still connect through that pre-existing remote tunnel.

I set the /bin/chmod to not be executable due to a failed copy and paste and got my commands switched around. Took some outside the box thinking but I used perl to reset it. I now use puppet to handle my configs, I use git to clone my puppet configs to a local user environment, test those changes then use git and push those changes back to puppet and update configs that way. Definitely a smarter way to go.

In other words: Nice try, but your attempt at being ‘clever’ actually is the opposite. Maybe you should try testing your cleverness first next time ? The words that come to mind of what it makes you look like otherwise, is.. well, I won’t even go there.

If you don’t have it aliased, it might be an issue (but again – see first point). However, anyone with say rm -i as the alias for rm (which is, though not the best choice – it is still a default on say red hat systems and potentially others). In all cases then, it’s pretty silly to think it’s going to cause an issue (oh and as the person above you said: backup).

So, yeah, it’s not an issue to delete files with special meaning characters. It’s also not the 70s when more was possible. I mean even these days you have less permission (as non root) to write to consoles and so on (say, echo or cat a file > /dev/pts/ …).

And interestingly, I just tried unalias on rm, and it was still smart enough to process it as the option -r rather than play around with the file. Really though, I shouldn’t be surprised given how programs do process arguments passed to the command line. It makes perfect sense given it starts with the – character.

One little program that will save a lot of time: molly-guard. I found it on Aptitude and asks you the hostname of the machine you want to shutdown/reboot. This should adequately prevent accidental shutdowns

Sometimes the filename is so ugly that you can’t even type it. In these cases you can address the file by its i-node: 1. Catch the i-node with ls -i 2. find . -inum -exec rm {} \; This trick is desribed in “Unix Power Tools” from O’Reilly.

Yes, and that book happens to be quite good. I have an old edition of it and its still good. You could of course use the ‘-delete’ option to find to make it ‘easier’, but observe the man page warning on it first (do man find and then type /-delete to search for that).

Another one for mixing up terminal windows here – I had two terminal tabs open in OSX. One local, one SSH onto the server. The local mySQL had some changes I wanted to put onto the server, and I wanted the local to match the server more closely (apart from the changes).

I was about to copy the server’s httpd.conf over to local, so I went to delete the local copy beforehand. cd’d to the right folder and went to delete it…. but between the two I’d swapped to a different window and then back, losing track of which terminal tab I was in. Deleted the httpd.conf file on the server and had to shut the box down while I rooted out the backup… then had to port all my changes over again.

my worst mistake is typing > instead of >>. i use >> way too often for my own good on files i really shouldn’t use it on. bash completion is the second worst, when i type rm startOfFilenam[tab] [enter without looking].

Someone had somehow created a file with zero length and the name “\”, without the double-quotes of course. Well, that just had to go.

On about the fourth or fifth try I noticed it appeared to be working but taking way too long to delete one file. Ctrl-c. Real fast. And of course I was root.

Was I ever lucky. I wiped the OS but stopped it before any user data was lost. The lead sys admin moved that HD to slot 1, put a new HD into slot 0, and handed me the OS CDs. I spent the rest of the evening re-installing the OS.

I’ve done some of these… ‘rm -rf fileprefix *’ late at night when trying to free disk space (instead of ‘fileprefix*’) removed all our corporate binaries; fortunately, backups had just completed and restore went quickly.

Here’s a trick I depend on every day: All my ssh/xterm windows are color-coded – production server windows are loud and glaringly obvious, no matter what I am doing. Makes me nervous just typing in them…

Ouch! I’ve had some SQL typos like that, too. Lesson I learned is to always BEGIN TRANSACTION; before any data-modifying statements. If I have several steps to do, I’ll do something like this in Notepad or vim:

BEGIN; DELETE FROM table name WHERE id – 1000 ; ROLLBACK; — COMMIT;

I try this a few times until I get it right. Once I do, I remove the ROLLBACK and run it one more time.

My worst mistake ever was when I wanted to remove all DOT-directories (.esd-1000 etc.) in my /tmp-directory.

cd /tmp rm -rf .*

As this took more than the expected very short time period, I realized that .* also expanded to .., so I effectively did an “rm -rf /tmp/..” which is equivalent to “rm -rf /”… As I didn’t know what files were gone, I reinstalled my system.

Way back, I was trying to debug a ‘.cshrc’ which was not working correctly and wanted to see where it was hanging. I put ‘#!/bin/csh -x’ at the head of the file and started a new session. Unremarkably, I was left staring at a blank screen with no output, but succeeded in exhausting all process slots, preventing anyone else from logging in and slowing the machine to a crawl. Luckily I had another window open and was able to remove the ‘#!/bin/csh -x’ and kill all my processes, although it took about 30 minutes. Never done it since !!

Not really quite a command line screwup, but some of you might find it humorous. It does involve volcopy, at least (remember that one?)

I used to manage a room full of 11/780’s and 750’s. They all used those old washing machine sized RP05 drives, think they held like 100 MB. They set the bus ID with a big plastic plug with a number, like a kid’s building block, that you stuck into a hole on the front. They used a removable platter stack – you could open the top of the drive like, well, a washing machine, reach in and unscrew the platters, and the whole thing came out. I took advantage of this to do full backups. Every Sunday night I’d come in, swap a spare set of platters into one of the drives, volcopy another drive to it, then just switch the ID blocks. The new copy became the old /usr or whatever, the old platters went on a shelf as a known good backup, and the now vacant drive was ready to run the hext backup. Lots faster than the 9 track tape drives I would have had to use and I could run more than one at once.

Except one time I ran the volcopy before I remembered to switch ID blocks. All volcopy cared about where “volumes” – disc partitions – and the garbage scratch disc I’d just told it to back up to my /usr drive had perfectly valid partitions on it…. Whoops.

I’ve made a few (along with most of those above) , and had some done my predecessors…

Overlapping partitions… the first one fills up, overwrites the start of the second one, which was when we found out that his backups didn’t work either.

In the days before vipw, editing the password file on a full partition.

Believing the old 2 volume System V ringbinder which told me to format, not mkfs a partition. At least my backups worked that time. Steep learning curve, that.

A colleague that realised that hey could replace /etc/passwd from his PC ( DEC Pathworks anyone??? ). Unfortunately a DOS formatted /etc/passwd is as much use as no /etc/passwd.

Re: the tar/untar problems with resetting the CWD permissions. Now you know the reason no to create an archive as tar cfz arch.tgz . – always add a directory above, or use tar cfz arch.tgz * .[a-zA-Z0-9]* – obviously ony use the second pattern if there are hidden files to transfer.

A long long time ago when I was learning my ropes as a DEC VAX / Ultrix admin, I hit the BREAK key on the VT220 terminal expecting to return to the LAT server prompt in order to open a new connection to another VAX (all serial terminals connected to a terminal server back in the days, must have been the late 80s or thereabouts).

However, I did not know that when you hit BREAK on the terminal connected directly to the server’s serial port, it, well, … breaks. ;-P

The machine was our main facultiy student server and had about 30 people logged in at the time, and as the classroom was right next to the admin room my boss and I instantly heard angry shouts and pitchforks being sharpened next door. He probably saved my life (or at the very least my reproductive capabilities ;-) by intoning in a rather loud voice: “Damn those pesky students! They’ve crashed math6 again!!!” ;-)

I’m just a young guy, however I do enjoy reading everyone’s war stories. Best to learn from listening to the people who have been there. I got a boneheaded one though. While I was updating my system, I forgot to update libc first. Luckily I had a boot disk handy, and was able to roll back the mistake. Cost me about 2 hours of a Sunday.

1) “last | reboot” instead of “last reboot” … 2) ldd /bin/ls then copied the following output line from this command as root and executed: libc.so.6 => /lib64/libc.so.6. This deletes the contents of libc.so ….

You’d have no employees then and you’d be on your own in the end. You’d also eventually (if not already) need to fire yourself.

And as for “noob mistakes” – either you’re lying, you’re unaware of mistakes you made or you simply are so inexperienced that the commands you type are incredibly basic. Maybe even a combination of the above.

10 years isn’t all that impressive anyway.

And while I try not to criticize based on grammar (no one is perfect and this is more like a discussion forum than a professional document), I’ll make the exception (and I admit I make mistakes too). You might (although anyone with experience knows that to be a complete lie) make no mistakes at the command prompt but you sure make a lot of mistakes in your comment as far as language is concerned. I’ll refrain from pointing them out for the reasons I already mentioned (and I’m not going to stoop so low).

I will however point out something else: those who are in denial about such things will never learn from mistakes and will never grow. In other words, they’re incapable of improving and should be asking themselves “why do I bother?” rather than insulting others who are more mature, honest and willing to grow. Truthfully, people with the attitude you have are afraid of admitting to mistakes and often the severity and amount of those mistakes. Does it really make you feel better? I doubt that very much.

Hint: humans learn from mistakes. So either you’re not human or you don’t learn often.

No one is perfect and that includes me. I’m far from perfect. But to insult someone for being human (and therefore prone to mistakes) is quite arrogant and hypocritical. I have made quite some bad mistakes (though most not command line but in source code). Bad mistakes also means basic mistakes and something I had never made even when starting (when you’d expect it to happen). But it happens. I remember being quite tired at the time but that was about it. Later on the problem was discovered and I spent a great deal of effort to track it down and did. What matters is how you respond to the problems you may run into or cause. It’s about responsibility, and always improving yourself (and if you can, others). He or she is doing the opposite.

woot reminds me of a bully. Nothing else. And anyone who has either experienced bullying first hand (as victim) or just has some insight to issues in the real world would most certainly know what that’s a sign of: very low self worth and esteem, embarrassed to admit to their own flaws and simply put people down to try to compensate for their own issues. It isn’t even authentic and they know it and everyone else knows it. The bully just is in denial. They may have issues that bother them but they are afraid to get help or whatever else. But in the end, it hurts everyone involved.

Anyway – I appreciate the kind words. :) Speaking of such – I love your little script up there!

Last big blooper was a simple restore of a test database — on the production server. I learned quickly how nice MySQL recovery works!

Anyone that can, color code your PS1 strings for different servers, or at least have test one color and prod another. it helps to be able to just glance at the screen and the prompt color tells (yells?) what box you are on. But then again — humanity reigns — see above.

Being seen as the ‘Unix Expert’ meant anytime an admin wanted help they would call us and immediately log us in as root. I wanted to see if a certain command (can’t remember which now) was installed in /bin so without looking typed cd /bin ls | grep thecommand without looking at the screen. Unfortunately the keyboard was not mapped properly and the “|” came out as a “>” overwriting grep

You guys should really stop doing normal stuff as root all of the time. Force yourself to sudo, pfexec, su – whatever suits your fancy – *any* time you want to do anything that you think might require root privileges. It’s probably the single biggest piece of advice I can give to a youngling administrator.

a dangerous one is this one: rm -rf /whatever The problem is maybe you’ll hit enter after the / (maybe someone pushes your chair, or whatever). Obviously your system is dead after that. my solution is to type: rm /whatever then use the left arrow, and insert the -rf missing part

For everyone who has done the ‘rm -rf *’ thing … there is a nice trick you can add to all the servers you administer ……

One time only, issue the command touch /-i

This creates a file called ‘-i’ and when the * in an rm -rf * mistake is expanded, it includes the ‘-i’ file name … that gets passed to the ‘rm’ command which treats it as a command line option – and prompts for confirmation of the action.

Its not related to unix but I was taking a backup of .cbs file from cbex tool which is used in Clarify11.5 in production environment. For taking backup you have to click on Export button but by mistake I clicked on Export/Purge button while it took the backup but it also deleted those files from production. I told my local delivery team about it and then quickly put the backup files in production again.

As a new Linux user it would be a big help if commenters could post some links to “safety scripts” and aliases. A script that intercepted a command like “rm -rf /*” and asked if I REALLY wanted to delete my entire filesystem would be nice.

I am learning to write scripts but shortcuts are always appreciated.

These kinds of scripts might also give some ideas to kernel programmers.

I think this is so true, I don’t want to cut my throat (Linux as straight razor) – tinyurl (dot) com/c9vwy2w

Be careful with that kind of thing. Okay so you put in .bashrc the following :

alias rm=’rm -i’

which then makes rm (once $HOME/.bashrc is sourced/read) interactive. Now what happens if you add a new user and/or forget that you’re a user on a new system (and have not set that up)? You assume that rm will protect you but instead you by accident remove however many files. By all means, get in the habit of adding the -i option every time you type rm but be very careful with what you put in aliases to save the time and effort (remember: yes, you save some time typing but does that matter when you have to fetch backups or worse not have backups, of important files – system files or not?). Relying on and trusting everything so easily is a dangerous game to play.

One of our cool thread server (running 4 guest ldoms) got rebooted by its own and when I was checking the server by mistake I issued command “last |reboot” instead of “last reboot” and server started rebooting again. I would have thought that it is because of faulty hardware or something else as server started rebooting. On the top of this, I executed this command to another running Solaris box (however I logged on a test solaris server) and this server started rebooted. That is the time I realize that the command I executed was wrong..

Very wrong even. You sent the output of last to reboot, or put another way you piped last to reboot. Of course ‘last reboot’ does indeed work as would ‘last | grep reboot’. I would suggest you learn more about the pipe because it is incredibly powerful. A nice article explaining this and other features of the shell can be found by (if you have info installed): info coreutils then type: /toolbox then hit enter twice and read the information.

Of course, even if you already know this stuff it is easy enough to do and that you admit it is half the battle.

I would argue that it is more like “problem” – as in it is not a real problem. It is not unpredictable in the slightest. If you did the same thing and in the same conditions it would do the same thing, yes? Yes, it would. So how is that unpredictable? Same command, same conditions, same result. It isn’t unpredictable – it is exactly the opposite. Can you really expect it to do more? If you so need to use -i (yes, the unix hater’s handbook is old but some of their stuff is still ridiculous for that time frame. But much credit to them for allowing some co-creators of UNIX to write parts of it, especially when – I believe it was Ken Thompson – they got the better of the authors). Blaming a system for your own mistake is really just making an excuse for not being perfect (which, not being perfect, isn’t bad – you learn, you expand, you make mistakes – we all make mistakes) which is the way to not really learn (you learn through mistakes if you actually admit to them). No one is perfect and placing the blame elsewhere shows who is at fault more than if you were to laugh it off and make a mental note (and I don’t necessarily mean YOU personally but I do mean people in general) and learn from it. And yes, that goes for everyone myself included. I’ve more than one screwed myself over because of doing [something] while sleep deprived. However, I took the blame (as in I blamed myself as it was my fault) and I fixed the problems from backup (in one example it was an overwritten file).

Although they are not the same thing file globbing and regular expressions have rules and logic to them. That there are only two files (other than the normal ‘.’ and ‘..’ – which be thankful * doesn’t trigger) is not the fault of the shell and as a user you make a choice: functionality or baby sitting. Any one who is using mv (or cp or any thing like that) on * without first checking (or knowing) which files it will act on is asking for trouble and it isn’t the system’s fault. The same could happen in other systems. Another choice of the user: whether you back up or not (and trusting an administrator to do all backups is not always the best plan although yes some times you don’t have a choice, depending on the environment). Unless it was a very new file then there’s really no excuse not for it to be in a backup (but that’s only one mistake).

(And I hope this did not sound as an attack or anything. I’m just making some points that everyone could stand to think about from time to time, because the world isn’t perfect but accepting that and improving where you can is so much better than blaming something or someone else. If it did feel like an attack I didn’t mean it to be and I’m sorry in advance there).

I once set up a Windows-Box via VirtualBox on my Server so I could use Visual Studio on my netbook. When the VM was running, I’d tried to connect to it via remote desktop. I installed 3 remote desktop applications on my server until I realized that I was using my server’s shell …

Maybe if they added it as default to the others (rm aside) but definitely not if you make it an alias and you ever expect to manage more than one system (even for a moment). Of course, any administrator who is doing such a thing to begin with (as in recursively running any command on / which ends up in a disaster or actually any command in any circumstance that is risky) would do well to learn of such things and frankly an administrator running a command they are not too familiar with is asking for trouble (“oops, I forgot I was root… ” Yes and if you just used sudo or log out once you’re done with root’s task you would not have that problem of forgetting. Besides: ever hear of whoami ?). But regardless there’s a fine line between baby sitting (which I would argue that adding those options as default is exactly baby sitting) and being helpful though (the opposite of baby sitting). Also by making it “easier” or adding more warnings you are in fact making people think less (a dangerous thing) and so what happens when they come across a different system without this setup (or alias) ? All it takes is one bad movement with e.g., a fat hand or finger (or not – just being careless is enough!) or even (for homes) a cat jumping on the keyboard at the wrong time (or like I did in the days of ctrlaltdel being in /etc/inittab : having two keyboards around attached to two different systems with only one monitor and hitting ctrl-alt-del on the wrong keyboard… stupid as it may be it happens). But at that point: the mistake is made, they have no idea WHAT they did or HOW it happened and they don’t even know what to do about it (they might not even know the problem exists until later depending on what the mistake was). One can argue that they should be protected from mistakes but the truth is no one is perfect and you can learn from your mistakes if you are responsible about the mistakes (or responsible in general). This isn’t Windows though and whether some think it’s too risky, too difficult or anything in between is their problem. One must learn to not (as I wrote somewhere else in this thread) be so trusting and actually be sure you know what the command you are typing is doing and also be aware of your environment (and I mean that in the sense of permissions/access and I mean it in the physical / spatial sense).

I had never used Unix but I was hired on contract to head up a small programming team for an insurance company. They had a SUN Solaris system that was totally overloaded so my first task was to purchase/install a new system. This was in the early 1990’s so Unix had not yet acquired many of it’s most helpful commands and options.

I was working 20+ hours a day trying to understand that strange nest of daemons called Unix, configure this new hardware so I could ethernet it to the old (for work space) and install and configure SAS. A little after 3AM about 4 days after I had started I had everything done and tested. Now, all that remained was to get rid of all the work and trash files so I could have clean file systems, etc.

So, from the root I did the only easy thing “rm *”.

I spent the next two days restoring everything, after reformatting the box.

Hi, I wanted to intraduce myself before teling you about an amusing hilarious mistake that I should state is hilarious now, though it was not funny when it happened. I’m a totally blind computer user, using a screen reader to navigate the web, and other things. All I T related things have been learned through experience, and no formal college education. I’m actually now the person who gets asked for any technical advice actually in my family. I’m regarded as the geek, and enjoy working on boxes over SSH. I spend a lot of time since I dont’ actually have a job, doing a lot of things, working on remote Linux boxes, and other geeky things.

Now, for the not so hilarious miake. If you don’t think blind folks cannot do social engineering, I hope you’ll reconsider! In this case, the computer ran both Windows XP (as my main production system) and Gentoo Linux at the time as a duel-boot. I’d somehow screwed up Windows, preventing it from booting. So I started up Gentoo, logged onto IRC, joined the channel where a lot of my blind geeky friends were, and asked for help. Please note that at the time, my NTFS partition was mounted read/write. Well, one of my blind IRC friends said: soething like: “Sure, I’ll help. Give me a user account with permission to get full admin access.” Then he submitted user ID and password, and instructed me to add him to the wheel group. I’d installed the superadduser package, (I think it was) that gave me a wizard interface of some type. So I added Mo (was his nickname on IRC) to my system, under the user account he’d asked me to setup. I as you can also imagine I hope, added him to the wheel group. I had no idea what wheel would let him do, and when he did this, I had to go take a bath at the time, so was out of the room. I came back to my Linux box, to find that it was no longer speaking at all. I thought it had locked up. So, I powered the machine off (as it was not accessible to me at all) at this point, from a blind user perspective (will explain more ina second) and tried to get back into Windows. The result was I had to reinstall Windows 98 with sighted assistance from my brother, and then upgrade again to Windows XP. I didn’t try the complex process of Gentoo reinstalls, as I wasn’t the one back in 2005 who did it, a friend of mine actually did it over SSH for me at the time. What I later learned was that Mo had executed a command that I had never heard of at the time, the command being: rm -rf / PLEASE DO NOT GO TYPING THAT IF YOU VALUE YOUR SYSTEM! I now know what that did, it basicly whipped everything (including my NTFS Windows volume), and also whiped the software speech components my Linux box used to speak through the sound card using the Speakup Linux screen reader for the text console. That is why I thought the system had “locked”, also please remember that without speech, or Braille (usually Sspech) since a Braille Display is very expensive, $2000 plus) and since that’s not something I have available at the moment, speech is the next best thing. But without either of those (especially since a monitor is pointless), you might think a box had locked up. How’s that for a mistake? And not one I intended to have happen. I got used though, and Mo did that on purpose knowing full well what it would do. That was his form of “help”.

Well, Keith, you’re mistake is not quite a command line mistake, but a naive (no offense intended) guy in need of help mistake…

I would never trust a guy from an IRC channel to help.

I’m not blind. If I needed help and let come an unknown person to my computer, and then, go away while he temper my system, I sure would make a mistake as well ;-)

The moral of the story is: Never trust anyone but close experienced friends/relatives or professionals when you need help :) Or google it and do it yourself, which might eventually be, in that case, a bit hardware while beeing blind. Or at least, it might take more time than if you had sight, I guess :)

I’m curious: how old are you and are you using Linux again now and then ? Your story must be quite old as you speak about Windows 98 and upgrading to XP.

As Superkikim points out, your mistake was you were too trusting. Trust is given all too easily then and still to this day. Anyone (this is a rhetorical question) remember hosts.equiv and its implications? To this day it is hard to imagine that it ever was a problem but oh was it a serious problem. Admins would argue that “I don’t care if the attacker has root in their own machine as long as it is not mine” while at the same time they allowed the r* commands and especially ANY host as long as you have the same login NAME. Well what do you think root can do? Exactly, create a user so they basically have control of at least one account which means they are one step closer to root access. And even when it was host-restricted you have to be realistic: although IP spoofing is a blind attack the thing to remember is IP spoofing is only part of the exploit (trust-relationship exploitation). Sad (albeit also funny) also is remote mounts being abused, too. And as I recall (might be remembering this part wrong although the exploit I know did happen) hosts.equiv was also used in order to hide the presence of a login (meaning ‘who’ for example, would not show that someone is logged in from some host – completely hidden so that they could work on the next step).

As for what I think about your comment on social engineering: I don’t believe for a second that that was social engineering. No, he didn’t have to do anything at all other than say he would help and do this so I can help. That’s akin to a coward mugging an elderly by saying “You look as if you need help getting that stuff in to your vehicle…” and then the elderly person agreeing followed by being robbed. While some might argue that this is semantics the truth of the matter is he didn’t do any engineering other than saying he’ll help (typically social engineering is to trick the user into giving the attacker information or giving them something that is of value based on who they are. Classic example: convincing people over the phone that they are in need of a login/password as they need it to fix a problem… yeah, what problem was it again? One the social engineering created from thin air in order to get into the system. That’s the only problem in that case and now the attacker is ahead of the unsuspecting victim).

I’m 26. As for it being social engineering or not, it was at least from the fact that this is what he intended to do the entire time once an account was created. He wanted to see if I was stupid. Or so he said. Would I just happily not do Internet research and setup an account, and do what I was told to do? These days, no way. In 2005? Probably. Although I have done my own command line mistakes, ticking off a blind friend who had helped me install Gentoo over SSH (and I was on a cruddy connection back then ) with a five second delay for terminal text to appear and be spoken by his speech system at the time. He had been monitoring me, and I had gone to run an emerge –remove package name at least that’s what I think it was back then. As part of the output generated, it said “empty /usr/lib” And I thought it wanted me to actually do it. So without even doing Internet research, man, or anything else, I happily ran: rm -rf /usr/lib as root. Then my friend sent me an ssh message to my terminal somehow teling me that he demanded to see me on Skype, and he was pissed off at the time (since he actually had to reinstall the filesystem). I then guessed Linux, or any other Unix system obviously couldn’t run well without /usr/lib. If that particular mistake isn’t a command line one, well then let’s find one that is.

I’d actually be curious to know if any of you have ran rm -rf /usr/lib before. Another time I did something similar to what one of you did on this thread, where you put a space between the / and something else after rm -rf as root, thus hosing your system, I’ve even done that on a dedicated production server over SSH. The result was I had to open a support ticket requesting that the CentOS system be installed again. In my experience, once you execute a command like that or the rm -rf /usr/lib example in a previous comment that I refered to, you might as well not bother with control-c.

No, I’ve not. But since many executables are dynamically linked then naturally removing libs will cause serious issues. But that said: you can salvage data from mistakes but whether or not depends on how long it was ran and what command. E.g., recursively using chown or chmod is a bad idea too if you are not 100% sure you know what you are doing (and that you are not at all going to hit parent directories which means as recursive you are hosing your system). But I’ve had a friend who accidentally ran chown -R ../ instead of what he meant to do (cannot recall what it was exactly and I’m too distracted to bother looking at the chat log about that one). He recovered from it and it mostly just caused about an hour of extra work (because he was quick to notice and react).

As for rm and whether you should bother hitting ctrl-c (note my friend did exactly that: hit ctrl-c and had an extra hour of work which wasn’t too bad but if he had let it run its course then it would have been bad. and he had to be root for the task because chown requires that — is a security risk for non-privileged users to be able to change a file owner) it really depends as well: First, consider if the admin (my friend does backup but many do not) is naive/something else not so nice. If they do not backup (shame on them but that won’t change anything, will it ?) then depending on the command and where they started they might be able to salvage their user data (or more) that they foolishly did not have backed up. It all depends. Bottom line is twofold (besides backing up but since many don’t do that let’s forget about that): 1. You can salvage messed up systems. Sure, some times it might be easier to start over but that is not always the case and it depends on WHO is doing that (read: their experience and expertise). 2. You (and this is something that is also ignored and I just don’t get it) need to be careful that you don’t just sit at root “in case you might need it” — really, how hard is it to use sudo or su ? An extra command versus an hour, hours or days of frustration, potential lost data, even a potential job loss (as in the company fires you).

And no, it was not social engineering. You can argue that it is all you want but then general pranks (no matter how mean or not) are social engineering. You fell for a nasty prank but he didn’t put any work in getting information out of you. He said he’d help and here’s what to do (what about people that do that and are sincere and actually do help ?). The fact he was being nasty and disguised it as help doesn’t really mean he social engineered you. Better said, he didn’t disguise himself as (for example) tech support or something like that. He didn’t go to you – you went to him. That last line is the main point. A classic example of a social engineer is Kevin Mitnick.

As for rm -rf on /usr/lib or its 64 counterpart, there is one other thing that can help you salvage it.

What is it? busybox or if available sash (stand alone shell, I think it stands for). Here’s a brief example. The first command shows how it is able to work when libraries are gone (of course the entire tree being gone may be a lot harder to fix but updates going afoul can cause problems with certain libraries and so by having the ability to fix them is very handy): $ /usr/bin/ldd /sbin/busybox not a dynamic executable Now, while this is only one command it can do, observe (deliberately leaving out any parameters as I’m merely showing the use): $ /sbin/busybox cp BusyBox v1.15.1 (2013-11-23 12:50:41 UTC) multi-call binary

Usage: cp [OPTIONS] SOURCE DEST

So any command it has support for (which is quite a few) it can run even if a library is gone, corrupted or something else). A friend who reminded me of this (the other day) – I knew of busybox but never really used it – made use of it when an update (via yum, most likely) corrupted his libc (which itself would be ugly if he did not have something like busybox, very ugly indeed). Also, on this note: yes, you should bother interrupting the command, because as you can see, you do have a chance to salvage it! I learned something that day but knowing that friend it is not that surprising he was able to do so (I consider him a mentor in many respects, computers and life).

Upgraded Debian 2.0 production server to 2.2 skipping one major version in between…

The upgrade went fairly well, until some process crashed, because of missing libraries or having wrong version or something. It is particularly nice to realize that if I reboot this machine I am totally fucked. Tried to download missing libraries with wget, couldn’t because of the missing libs. Almost every executable crashed because of the missing libs. Finally managed to use lynx to retrieve working version of libstdc.. and bit by bit I got it upgraded. Learned a lesson though…

Also, as a real novice I was asked by a user to install a software he wanted to the freebsd… I didn’t know how the freebsd worked so I managed to reinstall the whole system base, lost all the user accounts and so on. Finalized my own grave by not realizing that the FreeBSD was wise enough to keep copies of the passwd and group file.

I logged in to the server from SSH as root from one of my DC team for trouble shooting. After finishing the work I forgot to logoff. I checked if the terminal is closed from after sometime from my local machine from SSH. I was able to see that session was alive. So I simply used “pkill -KILL -u root” and then realised what I have done. It took half an hour to get the system back to on-line.

Not so much a fat-finger at the keyboard (though I have had plenty of those- including wiping out an entire copy of the production dataset that took 3 days to replicate… guess I’m waiting 3 more days!), but my most fun disaster occured when de-racking a server.

Boss and I were decabling this box for removal. He’d disconnected everything, and I was in the front to pull the thing forward. Keep in mind, this was a 2U box in 5 & 6 (very near the bottom). He said ‘OK!’ and so I went to yank it out. It stopped, bounced, and wouldn’t come out further. He said ‘Oh, just a cable snag’, and out it came.

The iLO cable, which had not been disconnected, wrapped itself around the isolate switch on the rack UPS in slots 1, 2, 3, and 4. So now we have this server out, but something is beeping. “What do we have in here that beeps?” Took out four domain controllers, the head-end Exchange, and a few other things. Turned out that some knucklehead had plugged both PDUs into the UPS we’d just shut off.

I didn’t see this in the original article nor in any of the comments so I thought I’d share this one. A lot of times we copy and paste commands into a terminal window. I’ve had cases where I’ve attempted to copy something but something messed up with the copy so I’ve pasted other text. Usually this isn’t that big of a deal but one person on my team pasted a bunch of logs into the terminal and the logs had a crap load of >> symbols in it. The application we had running on that host got completely messed up and had to get reinstalled. Hard to say how to prevent this except if you have any doubts what’s in your clipboard paste it somewhere else first.

Unless you’re an operating system! That’s a good thing though! Can you imagine how awful it would be to have an OS not able to do that? It’d be as bad as DOS only no TSRs either which is well, DOS is not very helpful even with TSRs (unless it’s for emulation of some old demo, some old game or some such). Indeed though, any interruption – be it taking a break or the bloody phone rings (possibly for the umpteenth time in the day due to bloody telemarketers) – can cause serious issues if not addressed properly. In general, yes, you should either check what’s in your ‘clip board’ or copy and paste right then and there. Of course, if you’re in vi or vim then notwithstanding copying into vim that is in a terminal session (under X, say) then you’ll need to yank (or alternatively delete which does the same but first removes the text) and then paste it then if you want to keep it (and be 100% sure of that). No idea about Escape Meta Alt Control Shift. Sorry, was that actually what I wrote? I meant emacs of course! What was I thinking? (Jokes aside, to any emacs users: I have no intent to offend let alone try to stop others from using what works for them. That said, I have to admit the Church of Emacs and its followup Cult of Vi is a very clever, very fun, very funny and a much better approach to X wars where X is OS, programming languages, text editors, shells, clients of certain services, CPUs and whatever else people might like to argue over).

Perhaps, perhaps not, depending on your definition. A properly programmed multi-threaded application does quite well. Improperly is another issue entirely. As for the OS itself, I can offer only two things: it does far better of a job than the human can, that much is for sure! If you’re the curious kind or you need further elaboration, try the experiment(s) in the man page for top, specifically this one (although the others probably are of interest too): “7. STUPID TRICKS Sampler Many of these ‘tricks’ work best when you give top a scheduling boost. So plan on starting him with a nice value of -10, assuming you’ve got the authority.

7a. Kernel Magic … ” I’d paste it all but it’s a bit long and I already pasted enough. You’ll note that with this you’ll see how many even cpu intensive tasks and non cpu intensive tasks all get their time and very quickly.

That aside: I’m not sure you can blame hardware bottlenecks on the OS, really… any more than you can the other way around. At least not, rationally, especially for a system that is efficient otherwise.

Why? Simple: cp is a core utility and without it it would make manipulating files (like, say, copying them) very difficult, to say the least (ignore dd, which I’ll be mentioning anyway.. besides, this defeats my point which is the real point to consider). cmp is for comparisons and similar is diff. As a programmer I use diff far more often than cmp (which is to suggest: very often to very rarely if ever; I much rather know the differences than where they differ, since I usually deal with source code of some form or another and even with text files[1] it is much more useful if you know how to read the output of diff). And of course cp file1 file2 will be the same with regards to cmp or otherwise: cp is copying file1 to file2. You could suggest the same for dd if=file1 of=file2 as well. Of course, this is not considering hard links (or cp’s options related to links) and neither is it considering file permissions (since the main point is to copy the contents of from one inode to another), at least in the sense of two files with every single attribute as the same (but yes I know that’s besides the point). But bottom line is this: the beauty of Unix (and therefore Linux based OS’s) is the ‘do one thing well’ philosphy (combined with the pipe) allowing for very flexible, very powerful system with many possibilities. What you notice is not even the worst (I don’t see it a problem at all, actually anymore than diff and diff3 being ‘similar’). What is a problem – but is inevitable and again this is another beauty of the environment … allowing different developers, different packages with similar but not necessarily equivalent functionality[2] – is when two packages (e.g., in a binary based distro) have the same file name (path included) so that only one can be installed at a time. I’ve seen this when it is quite a bit different capability, too. This just happens and there is no real reason to cp and cmp being similar in name (if that’s how you view it) other than cp was created for one purpose, cmp is by another and both names are decent for their uses. [1]Let’s be honest: there is a reason for the old error ‘Text file busy’ and that happening with ‘programs’ – a simple text file can be a program. Even then though, if you want to see more detail between two files, diff is more useful. [2]Examples: Apache mod_nss and mod_ssl – both offering TLS support for Apache. But this is only good: first, libraries are more likely to be named more carefully. second, some might prefer one over the other. Another really good example (though this not library but utility/service/whatever): sendmail versus postfix versus qmail.

Oh, okay, I cannot resist: there’s cases where cp file1 file2 will not result in what you expect. One cause (actually two): disk space and quotas. I imagine there are others too (not counting cp being aliased to something like ‘cp -n’ and file2 already exists or ‘cp -u’ and file2 is newer than file1, because those are ridiculous and I’m only mentioning them because I’m really bored). (-n is noclobber and -u is of course update and yes when I wrote ‘really bored’ I meant it in a way that suggests I’m deliberately coming up with ways – whether semantics or not – where it could be incorrect… I already answered the real thing you wondered so that’s my excuse).

And to be fair to you… yes, similar commands can cause a problem for those who make typos a lot, as an example (which is not really a good idea at the command prompt; everywhere else, well, fine). You can also argue this for > and >> (if that came out right: overwrite and append to file… and yes, as I’ve written here before, both are valid, both are very useful even though some are too afraid to learn how to make full use of them… which ultimately means that they will not be as proficient as they could be).

I’m not sure I ever submitted any mistakes… But here is one that I recently did do. It isn’t exactly a command prompt mistake but it is a BIND configuration issue and so is close enough (system administration mistake).

I got reverse delegation of my IP block and I was setting up named.conf to refer to the .in-addr.arpa zone. Well, while the zone file itself was fine on first go, I unfortunately made a typo: I ended the zone with -in-addr.arpa (- instead .) in named.conf. This means the zone was named one thing (-in-addr.arpa is not correct and it is what the zone was named due to the typo) while the zone file was expected to be something else (correctly named). This led to two or three days of wondering why the …. I was getting out of zone errors (and other errors) when doing rndc reload (as well as queries to my server), why when I did the same changes on an internal network (CNAME Delegation is the way my ISP does it) and then doing the exact same thing with my public block, I still got errors. Further frustration was when I did the same thing with a ‘public’ (but not static) IP in an internal view (= a BIND view that only answers for hosts given in the acl list specified) it worked fine. Well, there were two things that happened, the first being the one I already mentioned (single character typo). The second was that combined with the fact I have really poor vision to begin with (even after multiple eye surgeries earlier in my life and even with glasses) and the fact I had been awake (not deliberately, but unable to sleep) since 1 or so in the morning (all of these days) and working on this problem from anywhere between 4 to 6 am, and then later into the day, and I was oblivious to why I was struggling with this (I have set up PTR records plenty in the past and never had this problem). I was going completely mad (and beyond what I normally am, which is kind of scary…) until I by chance noticed the typo. The problem is of course that BIND will happily name the zone that way where as in programming a single character typo will (if syntax error) generate a compiler error (which if you are experienced enough – I am, so I would have preferred that – you can easily enough decipher the cryptic errors that can occur). This is more like a typo that – while is a mistake – is also valid (somewhat like a spelling checker cannot help with the fact you have a valid word but in the context it is it is incorrect when reading it. e.g., “He wants to the store.” instead of “He went to the store.”).

Cannot truthfully state I made any of the mistakes mentioned in the top 10 but I have certainly made plenty of mistakes (and I only learn more) and I could easily see myself making one of those mistakes when tired, distracted or just simply being what is expected of humans. One I just remembered I made actually could be suggested it is at the command prompt. Symbolic links and chmod when non-recursively can be dangerous. Especially if it happens to link to (for example) /dev/null (which is rather important to the system… as I recall, mknod won’t work without it though I might be remembering wrong). Note: It was a mistake in a cron job and it wasn’t actually trying to change the permissions of that link (and neither the file it refers to) but rather files in that directory – and only that directory – but unfortunately I forgot about the symlink (and actually it was a stupid thing to begin with really… the best I can think of why I did it is updates [e.g., of a CMS] changing creating files with permissions that I didn’t like [not strict enough]).. I had a link to /dev/null for something else. I was able to fix the mistake without any trouble though. But lesson learned: if you need a null like device for anything, just check the major/minor of /dev/null and use mknod with those properties at a different location (whereever the link is). More specifically: be careful with symbolic links and what it refers to.

As for rsync suggestion, I disagree. It is quite useful in many occasions, if used properly. But here’s another way it is relevant: mirroring one site to another (that are in different parts of the world), be it over ssh or otherwise. It is very useful there and can offer many speed ups. As for dump being the only safe option, I don’t really agree that there is any 100% way. Certainly some utilities are going to do better than others and certainly some will do better with some times of data than other utilities will, but there’s no 100% fool proof solution (related to different file types is that compressing certain file types is not too useful [sure there can be some small decreases in size when – example – compressing an mp3, but also exists the chance of increase in size due to extra data in e.g., an archive]. Example: if you compress an mp3 file and a text file, both of the same size, with the same compressing util, and same options, the text file should win). I write should because I suppose there are exceptions to the rule (as always) but a text file is not compressed where as an mp3 file is compressed (at least if it truly is an mp3 file with normal mp3 bitrates/etc.). Also, text is much more simple to process (and understandably so: it represents less).

Was showing a colleague how to use the date command to output date strings for different times other than now eg. date 01010101. I’ve done this many times before on Linux boxen, knowing that you need to use “date -s” to set the date. Only this was a Solaris 6 box, as root, and “date 01010101″ actually set the date!

This was the active server in a bank’s primary oracle database cluster and needless to say it wasn’t happy at suddenly going back in time and glorked itself!

Ah yes… I forgot about that difference (been years since I’ve used Solaris). But yes, the real error was being root simply to show someone that kind of thing (or anything that does not require root!). And yes, time is absolutely critical…

If you don’t actually format a partition after it, you can generally recover the entire partition as long as you can boot into an environment (e.g., boot disk or before you reboot). If you format it your chances are less although not by any means is it impossible (Windows file systems, I seem to recall, you have an easier time recovering. On the other hand, Windows filesystems manage files differently and much more fragmentation because of it). As for how you recover partitions? It has been too long for me to actually explain it (I’m not going to try it just for this reason even though I know I could work it out) but it IS possible so look in to it. And I might add something else: this is possible regardless of the partition table changing or not (as I seem to recall). As in you don’t need to have a backup of the partition table, to recover the partition (contents thereof). (There’s always exceptions to the rule, of course.) Keeping a backup is a good idea, however. Keeping backups, period (partition table and backups in general), is what too many neglect!

I have definitely done a version of the chmod -R mistake. I was just learning about file permissions on my first Ubuntu development box and realized I didn’t need the execute bit on my website files. So I decided to chmod -R -x /[TAB] . I think it sat there for a while removing execute bit from every program on the system before I ctrl-C it, but after that, I had no more working command line tools. Had to re-install, but no real damage done. I learned the power and responsibility that comes with the command line that day.

Another time, I was in a production web server about to clean up a content folder I didn’t need anymore. It was late at night (a bad bad time to use rm command). I was one directory higher than I thought I was, wiped out the whole content folder for all the clients! My latest backup was from 3 months ago. Luckily my partner had a good humor about the whole thing and helped rebuild the missing content from the past 3 months within 24hrs from his original docs.

Simple boot-up-side-head issue… Scenario: Performing a software upgrade on a customer site. Unix system running a script that takes about 13 hours to complete… about 9 hours into the upgrade. I copy select text from the output of the script (UNIX) and paste it into a document on the local PC (Windows)… Simple procedure – Click on UNIX window, highlight text, click on windows document, Cntl-V – repeat

See, the M$ confusion :) This is (also) why I always avoid using these stupid M$ keystrokes when on an M$ machine (the person who devised them must have been a complete idiot), and instead use the good old Borland ones: Ctrl-Ins (copy), Shift-Ins (paste) and Shift-Del (cut), learnt back in the 1990s when starting with Pascal & C on DOS. Fortunately, most of the Win-programs still do support them – I only found a few flash-written web apps which do not, shame on their creators. Yeah, they are a bit less convenient – you cannot do them with your left hand only without moving your right hand away from the arrows but the error-safety and ease of not getting confused (Ctrl-C needs no explanation, and most of you knows what Ctrl-V is used for in VI(M), for example) is worth this little inconvenience. Moreover, this old Borland standard is even supported in many Linux applications. So, for all of you who do not know these keystrokes and have to work with M$, try learning them, they might save your time & data ;)

that little dot ( . ) will get you in hoards of trouble if accidentally ommitted or a space put inbetween….. Huge difference between what rm -r ./ and rm -r / does. Remember – the [TAB] button is your friend :)

loving this. thread is tl;dr but have my own list. extensive. 20+ years Linux/UNIX here.

Yet, less than 6 months ago, instead of doing the usual USB thumb drive setup in my laptop (which is always /dev/sdc), I put one into my large backup server with a 3TB RAID-5 external cabinet. Instead of dropping an ISO on the thumb drive, when I ran ‘dcfldd if=a.iso of=/dev/sdc’ I trashed one of the RAID drives (thumb drive was /dev/sdh). Worse luck, the array would not rebuild because of unrelated issue, so could not break the array and add sdc back to let it rebuild. Had to restore from offsite backup.