Human Error Competition

Doug's Oracle Blog

In the comments to the last post, Gregory suggested a blog thread to collect people's favourite (or not-so-favourite) stories of "Human Errors". As Gregory said ...

"I wish somebody could create a post to collect all the "Human Errors"
(I'm sure people are very creative when generating "Fun")."

Post your best story here or, because these stories may be long and detailed, post them on your own blog with a link here. I'm assuming we'll keep this to those errors involving computer systems otherwise it could get out of control.

Unlike Tom Kyte's new features challenge, I don't have a book that I've written to hand to give away as a prize, but as I'm about to attend Openworld I'm sure that I can pick up a suitable prize there, no expense paid. Failing that, I'll have a whip-round among The Boys (but that usually raises about 37 pence). I will be the sole judge of the best story and will be applying a cynical eye to detect any fabrication. (Although these stories do tend to beggar belief.)

P.S. Bear in mind that you can post anonymously and you can pretend that the story involves the error of a 'friend of yours'.P.P.S. I won't be entering the competition, but I think my credentials in this area are already solid.

There was this programmer, who had to add this nice little feature that would allow users of a trusted 3rd party emailing system to do 'some' cleanup of their mail.
He coded it, he tested it, and he released it for overnight deployment. The next morning the phones at the helpdesk started ringing rather early. People were missing emails. Customers were missing emails. Actually, all customers were missing all their email bodies. The programmer was found to be on a plane, heading for a tropical scuba diving holiday. Well earned, after fixing this last issue so rapidly. Alas, the test he did had only email bodies of one person in the system. And yes, after the delete command they were gone. So, he concluded the test was succesful. After deployment, and without a proper 'WHERE userid = :1' clause in his SQL statement, the cleanup job was rather efficient, when the first user tried this nice new feature, cleaning all email bodies of all users in the system.........

We were gearing up to fill our first Oracle datawarehouse and of course we did not have any backup in place yet. No time to think about that. First priority was to get some data in, as always. After a lot of hassle and query tuning we got our first few gigabytes of data more or less in a couple of tablespaces.
We had been using a second database for testing purposes and one of the dba's that had been working on other projects was about to join into the DWH project. And what better way to get the feel of things than to play around with loading some things in the test environment? But , we said, please clean up the tablespaces before you start because they are automagically created during the load process. Coming from mostly 8i database the dba found a new feature that came in handy there; "drop tablespace including contents and datafiles". What a nice chance to test how that works. Log on to the database, enter the command and execute it. Nothing simpler, works like a charm. Only to notice just after executing the command that the last letter of the database name was not a T for Test but a P for Production....

And so it was a very long night restoring a tablespace from an operating system backup that only just fitted twice on the disks. Nevertheless even that was a good exercise in working under "some" pressure. As some people reading this blog must surely remember.

Ooooh yes. I remember. Restoring a database, that was cpio-ed to tape, without any 'alter tablespace begin backup' precautions. And the database was rebuilding indexes, whilst the so called 'backup' was made. You forgot to mention that someone was hired a couple of months before, 'to design a backup plan'. Which he refused, because he only makes 'restore plans'. Backups are just there to support the latter. So the restore plan was available. But the DWH 'wasn't production yet'. It was only preloaded with huge amounts of data, gathered at quite some cost, including downtime of 24x7 production systems to get 'clean reads' of otherwise volatile data, and the dumps from the old system got deleted, because of disk space constraints. So a reload was not an option. I cannot repeat often enough: Except from your own VMWare sandbox database any database is a production database. Developers, consultants, whatever you call them, creating stuff in your database, they cost money. They cost, because they do something. And what they do, somehow is called 'production'. Maybe not the primary process of the company, but certainly useful. Otherwise you wouldn't pay them. And you don't want to loose what they created, otherwise they have to create it again, and worse, you have to pay them again. So, have a restore plan for any database.

@Doug: Eric only told 1/3 of the story. I'm afraid you already have a winner here..........

Back in the distant dark days of 1997 a friend of mine once accidentally depressed the 'Break' key on the VT320 console keyboard of a Sun server running Oracle7. We didn't have Chimp of the Month back then, but I'm sure that if we did then it would have been one of my older trophies.

What is the moral of the story? Don't drop radio pagers onto keyboards.

Well, I could talk about the time I spent 30 minutes trying to troubleshoot a problem laptop with a problem user (with a long history of trashing laptops). His "diagnosis" was "the computer burned up". After telling him to send it in to the main office so we could properly diagnose the problem...my manager informed me that there had been a fire at the user's worksite...the laptop literally burned up!

Or the time I joked with a colleague that the 5 1/4 slot on the front of her pc was actually a cd-rom drive...and then had to help her extract a cd from the 5 1/4 slot.

No...probably the best one was when I deleted the ORACLE_HOME and ORACLE_BASE directories from a production server...while the database was up and running.

Here is another real-life story and the hero was playing BOFH
I think. It was probably his last time, though...

We (DBA's) saved some very important data from the database. We copied the compressed file into few locations but management was still nervous and wanted it definitely on tape so they made me calling our SA and asking him to put that big file on tape (as if he would be able to find it in few weeks!). Anyway, one hour later, the SA is calling back to notify me that he *successfully* copied (cpio'ed) the tape content on top of my super important file.

Surely enough, I told everyone that I had removed all other copies as our valorous SA could store it on tape in much more preserved way...
Of course, I lied but I had my 5 minutes of enjoyment looking at their white faces.

I've changed my mind a bit. This is not an official entry, but I think it's only fair that I share one of my favourites that I don't think I've posted before.

I was working at one of the biggest merchant banks in the City of London in the mid 90s. One of the systems I worked on was the largest back-office trade settlement system at the time, in terms of number of trades processed per day. Our overnight batch was in constant danger of over-running.

One day a friend of mine committed one of the cardinal sins of logging into a Dev server 'through' the production server. i.e. When logged into Prod, they used rlogin to the Dev box to take care of some things. Someone asked them to shutdown the dev instance. Unfortunately they hit Ctrl-D once too often, didn't have a sensible prompt set (this is layers of mistakes, here) and used sqlplus to connect to the database. They *didn't* check which instance they were connected to and they shut it down.

Within a few seconds (and I'm not exaggerating) the phones started ringing. My friend tells me that they knew what they had done when they heard the phone ring. Just knew.

OK, big mistake.

Next the head of application development and support was on the phone. Yes, it's true. Let's get the database restarted asap and restart the trade feed processes.

OK, job done. Very embarrassing (if it had been me) but everyone makes mistakes.

The next day I arrived at work and the problem was more or less forgotten. Better still, batch had been *much* quicker last night. Cool! Then someone realised why. When the database had been restarted, the trade feeds hadn't. The DBA swore he'd ask app support to do it. App support swore he (or she) hadn't.

Damn. That meant trades weren't settled on the correct day. The Bank of England takes a pretty dim view of that kind of thing. As do customers.

Then the final piece of the ceiling fell on top of us as we realised that if only a third of yesterday's trades were processed last night, then we had 166% of the usual trades to process in tonight's batch.

I understand the DBA stayed awake late that night to monitor the progress of batch jobs. They went through ok.

I was called in at a gouvernment site.
They were using *very* old IBM hardware.
When I asked what the hammer attached to
the disk array was for - I got this answer:
Each time we stop the disk array... the disks would cool down and the heads would stick to the platter.
So, when we boot it up again, we listen
to the disks spinning up and then we hit
it with the hammer so the heads would losen up.

I got called in because a disk had crashed ... and there was no backup !

Oh - we got the data back ... the weird
thing was that not the disk nor head was
broken ... but the electronics interface.
We replaced it with the pcb of another similar disk and it worked again.
We needed an expert in getten raw data
from the disk (AIX4/LVM/jfs) and then
we used a dataunloader to get the data
out of the datafile

One that comes to mind happened when I worked for a certain Facilities Management company looking after the servers for a certain large transportation infrastructure company.

We had this guy who had just started the previous week straight out of university. Apparently he was a total genius, he told us so. The /tmp tablespace on one server had filled up (this was HPUX 9, it tends to do that) so he had been given the task of clearing it down. He logged in as root (which had a home directory of '/', the default for HPUX) and typed 'cd/tmp' followed by 'rm -rf *'. The more sharp eyed readers (or those I've told this to before) will note that there is a difference between 'cd /tmp' (what he intended to type) and 'cd/tmp' (what he actually typed). Had he looked at the screen he would have noticed an error message telling him that there was no command or scipt called 'cd/tmp' and he was therefore still in the root directory, logged in as root, on the payroll server, on the day the evening of which (by some coincidence) the monthly, 4 weekly and weekly payrolls were all to be run. Probably the worst place and time to type 'rm -rf *' and hit enter.

Fortunately there were two mitigating factors. Firstly the way HPUX works with wild cards is to first expand them and build a list of matching entries to process and then run the command. Secondly I happened to be looking at his screen at the time. I dived over and hit CTRL-C a number of times so no files were actually deleted. I don't know if the command would have completed or if it would have failed because the list of files would be too large and would blow some buffer. Frankly, I don't ever want to be in the position to test it. Especially not when logged in as root, on the payroll server, the day of the pay run.

A co-worker, while working remotely on a Client's production server, started a job in detached and background mode as root. But then he realized that he had forgotten one minor step and had to kill it and asked me how to kill the job.
He walked away and within a couple of minutes the customer service guy came running to me stating that the Customer is getting absolutely no response from the server/application.. and then the co-worker enters my office stating the same thing. I asked him how he killed his background job.... found out that he forgot the % before killing job 1 with -9 option.....

Reminds me of the practise disaster recovery sessions we used to have once a year. We would take everything we had backed up and stored in the remote safe location and start restoring it on virtually maiden hardware. Nice to do and almost always successful. The nicest thing was the fight between the participants over who was to type the sequence:
su -
cd /
rm -fr *

My top mistake? I wanted to get the exact table defintions of a datamart to test, so logged on to production with an ERD tool to reverse enginineer the database. Somebody asked me a question and I forgot to log off. Hit he create including delete tables...the tool never asked where to connect. You get all hot and realise you just thrown away the complete datamrt...no back-up and this datamart could only be loaded at the specific monthly closures. They were not happy.......;-

My disks were filling up so i ran a job that cleared up all log files.Ran this job as a root and it cleaned up all the huge log files that happened to be lying around on the precious unix server hogging up all the disk space unnecssarily that huge alert log file got cleaned up and freed up valuable disk space and then the script also managed to clean up my redo*.log

On a related point, I've seen people run into problems with their alert log growing to insane sizes because they never archive it off. This has a number of issues: Oracle Support really don't like it when you upload a multi-gig alert log of which only the last 1k is relevant; Can take up a huge amount of disk space; Makes it #1 candidate for removal when the sysadmin is looking to clear some disk space at 3am because of a filesystem full message and "it's just a log file" so you risk losing useful diagnostic info in case of an issue arising.

Personally I like to archive it off to a datestamped file every week then compress after a month (all handled by a cron script) and dump to offline media a couple of times a year. This also provides a quick check, if the size of the archived files suddenly changes drastically (like normally they're X Mb and last week's is 4X Mb) it's a quick flag that something has happened. I have found it difficult to convince some of the DBAs I've had to work with to go along with this.

Well this one is a little long, but I tried putting in a link instead and it got kicked out as spam, so I'll just paste it in.

My very first job out of college was with a large oil company that had Oracle running on bunch of VAX/VMS systems. We had a lot of code written in FORTRAN with OCI calls (this was back in the early 80’s, so no pre-compilers yet). I was working late one night from home, which in itself was an unusual thing because not too many people were able to work remotely at that time. We only had 1200 baud modems for crying outloud, so it was painful to do much of anything remotely. At any rate, I was working on a program with some kind of iterative processing which took a while to complete. So I’d make a few changes and run it, make a few changes and run it. Well I noticed that the execution time slowed down somewhat and so I went looking to see what else was running on the system. (brief digression: I had become a neophyte sys admin due to my being the Oracle DBA and needing to have some system privileges for doing upgrades and whatnot) So I had a look to see what might be slowing my program down and sure enough there was a batch job running that was really using a lot of cpu. Well I had learned about the ability of VMS to set process priorities and so I thought to myself, “that batch job has all night to run it shouldn’t be slowing me down right now”. So I determined to change the priorities so my program would not be competing so heavily with the batch job. Unfortunately, instead of lowering the priority of the batch process, I jacked the priority of my process way up. (you’ll see why I say “unfortunately” in a minute) So anyway, the priority change worked out great. My program executions began running even faster than they had prior to the batch job kicking off. So I went back to my programming routine. Make a few changes, execute the program, check the results, make a few changes, execute the program, check the results… until I executed the program and it didn’t come back. I remember thinking, “Uh oh, I think I messed up the check for getting out of that loop”. So I thought, “well I’ll just ctl-C out and fix it”. Unfortunately, in a stupendous example of Murphy’s Law, it was at just this point that my modem lost it’s connection. Great! So I tried to reconnect. The modem was able to establish a connection, but the machine was so busy running the process with the insanely high priority that it didn’t have enough spare CPU to log another process in. (Unlike most systems today, VMS had a very hard priority system. The process with the highest priority basically stayed on the CPU as long as they wanted - oh and by the way, there was only the one CPU) So anyway, the program ended up running most of the night and only stopped because it filled up the disk with a log file that it was writing. Needless to say, the real sys admins were not too happy with me the next morning when I showed up at work.

Add Comment

Name

Email

Homepage

In reply to

Comment

Standard emoticons like :-) and ;-) are converted to images.

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

Disclaimer

For the avoidance of any doubt, all views expressed here are my own and not those of past or current employers, clients, friends, Oracle Corporation, my Mum or, indeed, Flatcat. If you want to sue someone, I suggest you pick on Tigger, but I hope you have a good lawyer. Frankly, I doubt any of the former agree with my views or would want to be associated with them in any way.