Last Monday, one of Singapore's largest banks suffered a seven-hour IT outage that took down everything from back-office services to ATMs. This Tuesday, the flawed component was identified: an IBM employee.
"We take full responsibility for this incident," wrote DBS Group Holdings CEO Piyush Gupta in a statement. A laudably …

There, but for the grace of god

... goes pretty much every major company in the world.

The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation. The only surprise is that this sort of thing doesn't happen more often - or maybe just that it isn't reported more often.

Until systems are built robust enough to survive the onslaught of a trainee with the manual held upside-down, we really can't call what we do a "profession".

Couldn't agree more...

and with every large company on the face of the Earth sucking out all the cash for executive bonuses in the multi-millions instead of on training we'll see a lot more of this. The last of the folks who know what they are doing, had sufficient training to work on complex systems, are starting/have started to retire.

Fun times ahead, wonder if companies will be able to sue retired executives for bad business practices after they've retired or moved on. You know, once it becomes apparent to everyone that they have ruined the companies that paid them.

RE: There, but for the grace of god

".....The biggest failure in IT is that anyone with root has the power, or bad luck, to place the company they work for in exactly this situation....." Yeah, so comforting to just blame the sysadmin, but the truth is this is a management failure, as just about every "laugh-at-the-silly-admin-that-pulled-the-wrong-disk" situation actually resolves down to. Why? Because it is management that selects the admin and gives them that root access. You wouldn't give a novice driver the keys to your Ferrari, would you? If you did, and they bent it, wouldn't you feel just a bit to blame for putting them in the driving seat?

This wouldn't have been some architect-level tech genius, this was probably the junior admin if they were doing the overnight shift. Read the article - the admin thought he was using a good procedure, the fact he didn't know it was a wrong procedure highlights several possible management failings:

1/ They hired an incompetant admin that didn't have the up-to-date training he claimed to have (i.e., he lied on his CV), which means their selection process was flawed (probably because they didn't include a skilled sysadmin in the selection team, who would have spotted the "exaggerations", just used HR drones).

2/ The bank introduced new kit but IBM didn't do the requisite staff training, either because they didn't check their staff's skillsets; or IBM decided to save a few pennies and just told the sysadmin to "self-train on the job"; or IBM actually didn't know what the new kit required, and hence couldn't provide a correctly skilled resource, probably because it was another vendor's kit.

3/ IBM management didn't assign a competent technical project manager or technical team leader who should have looked at the new kit when it was introduced, review any new procedures, update the sysadmin procedures and plan any additional training to get their skillset right.

So, blame the sysadmin if it makes you feel better, but it was incompetant management that put that incorrectly prepared sysadmin at the console.

@Combat, @Pete

I don't know about you but I wouldn't have a trainee working on my system at 3am. They'd be working on it when I'm there to supervise. And I'm senior, so I don't work at 3am :) Hell I don't even work at 3pm.

3am is the Fail

I've run 24 hour operations, and I used to insist that we subject all new procedures to "the 3am test". No-one is at their best at this time, 2am or 4am are much, much better bets, so we would make sure all procedures were simple enough to be followed at 3am, and any procedure that we had the choice over timing would very definitely not be done at 3am.

I have no idea *why* 3am is such a problem, but observably it most certainly is. Something to do with biorhythms, I think.

RE: 3am is the Fail

So true! It always used to make me laugh that project managers would insist on scheduling work on 24 x 7 bizz crit systems for 2am, on the specious idea that the systems would be least busy then and it was therefore less risky! The problem is that your best staff and your vendor's best staff are also likely to be busy sleeping at that time. I used to be a real pain and call everyone involved in such early morning changes every fifteen-odd minutes just to make sure they were still awake, because you'd be surprised at the number of times I caught employees asleep at the console whilst they waited for someone else to complete some part of a change. The best one we had was when one sleepy admin rebooted the production billing server rather than a stand-by one, and it took ten long minutes before anyone realised! That's ten minutes of skilled people looking at screens and just not registering what was actually appearing on them, simply because they had hit that 3am low.

Employee

Fragile process

If an error in a "routine maintenance operation" causes this kind of outage, then I'd say it's a larger problem in the process rather than training. (Not that the place I work at is immune to these kinds of process failures, mind you, but at least we identify them for what they are.)

As an ex-IBMer

I've seen this happen a few times. The IBM method is get rid of anyone who has any skills (IE expensive) and replace them with trained monkeys that just blindly follow scripts. The monkeys are not able to tell if the processes in the scripts are valid or correct as they have never had the training to even have a basic understanding of what they are attempting to do.

I suggest to any company looking at outsourcing their IT support to IBM, DON"T DO IT! Ask Air New Zealand about what IBM IT Support did to them. I think IBM had them such down for about a day.

What if it happened in your company?

Greetings from Singapore! I'd like to pose a question to Reg readers to gain some insight into management practices under different cultural context.

If such kind of massive failure happens in your country/company (as a service provider), what will happen next? I mean, to the unfortunate (or clueless?) sysadm/engineer, Head of IT, COO, and/or even CEO. And where should the buck stop?

Over here in Asia, they will quietly reprimand the the engineer and his/her supervisor, or even let them go. But the buck would usually stop there. Appreciate your comments. Thanks.

what happens? not much

The few occasions where I've seen cockups turning into big problems (dba dropping tables on a production database, coffee cup knocked over into the main router, someone changing root password and instantly forgetting the new one) the person involved has received admonitions from their peers/boss ("you PLONKER", etc.), but career-wise except for the coffee issue, they were regarded as "blips" in otherwise good work records.

The coffee-knocker left shortly afterwards of their own free will.

The conclusion was that these accidents could have happened to anyone and that everyone makes a mistake now and again. While this is true, and universally recognised, the underlying problem with our industry is that this is accepted and few, if any companies feel the situation needs to be, or can be improved. You do get point solutions to specific (costly) errors after the fact, but all the processes in the world: BS5750, ISO9000, ITIL don't seem to account for figner trouble and the IT systems themselves are designed to be so brittle that a simple error can kill them.

Medal

the IBM field circus

Well, this what happens when you give an important support contract to a company who's CEO only aim is to boost his own earning by increasing the company share price, when sales are falling, by buying your own stock and sacking trained and experienced personnel.

I too have seen the IBM field circus fuck up and bend a pin on a chip after 'servicing' the mainframe and as a result the machine could not do simple maths.

Another place I worked in used to get a significant portion of their hard ware from Amdahl, just to remind big blue that there were alternatives....

Robust infrastructure?

While I can't comment on the abilities of the tech that started the 'cascade failure', this does bring back memories of the email storage saga at Plusnet a while back.

They were replacing the storage arrays with shiny new ones and a tech (presumably a system architect or senior admin) had a console window open to both the old and the new array and issued a format command in the wrong console. Que ensuing shit storm including the revelation that plusnet don't (or didn't) backup the arrays coupled with lots of (failed) attempts at data recovery and the vendor admitting that the array was 'to new' to work with their recovery tools.

Human error (from the admin perspective) is always going to be a factor in managing complex systems. When systems are robust enough to survive human error we won't need sys admins any more, just someone to turn up and set the stuff up.

That said there are always times when you have some kit that just doesn't want to play nice.

Why it's called human error

Doesn't have to be complex system either. Just has to involve humans. Or, in the following, humans + beer.

I was once present in a recording studio in the 70s when a famous English band was doing an album. In those days the big tapes used would hold about 30 minutes of music, so you would need two for an album.

You can guess what's coming. They were reviewing tracks on the first tape, and then all went to the pub, leaving the Tape Operator to sort out a new tape so they could start recording more stuff. The tape was so expensive that it was normal to reuse it, wiping it first, of course.

They came back from lunch, worked for many hours (this in the days when an album took months to record because it took that long to get them all there at the same time), and decided they wanted to listen to something on the first tape again.

The tape op went to the cupboard to get the first tape.........

The look on his face was a picture. I didn't know skin could go that colour. They forgave him. There are references to the incident on the credits for the eventual album. More than one, actually.

DBS ought to get used to it....

procedural error ...

... there it is again, the "insight" that all errors are avoidable by following the correct procedures.

I'm sure IBM Global Services updates, extends, enhances, adapts its procedures all the time and, most importantly, creates new ones to cover whichever areas are found in root-cause-analysis (a standardized procedure, of course) to be lacking procedural coverage.

All so to make sure errors don't happen. Which they obviously can't, if the procedures are being followed.

Why is the belief so widespread that the human factor can be eliminated by creating ever more / ever more detailed procedures ?

Seems natural that the more procedures there are, the more likely it is for some poor tired soul working at 3:00am to follow one of them that happens to be inappropriate for the situation ...

My condolences to the poor employee who followed the wrong procedure.

My kudos to the manager who publicly admitted that making errors is human.

been there...

Or rather, I've been the poor sod who received the panic telephone call at 11:30pm from the technical director of what is now a major player in the UK mobile phone industry, but was then still only a few dozen employees in size. Seems some lower minion had been preparing to restore some test data from tape, and had done a rm -r * in the wrong folder, and wiped out the entire live accounts and sales systems... Luckily enough important stuff had been backed up earlier that evening, so the damage was recoverable from... The owners are now multi-millionaires. I sometimes wonder just how well they would have done had they actually lost it all...

Not a process issue, but a process ADOPTION issue

It is the little things that trip us up. So this problem is NOT a process issue. It is a process ADOPTION issue. And why are processes not adopted (ie understood and used) by staff.

Simple. Most process documentation is aimed at IT building better systems, not at engaging end users so they understand what job they have to do, with links to the supporting forms, policies, work instructions, documents and systems. The process application (such as Nimbus Control) which can support this requirement is very different from traditional IT process modeling tools.

Ironically , IBM has bought loads of IT process modelling/management tools; System Architect, Holosofx, Lombardi - all of which are aimed at process automation - which are all now some part of Websphere.

So Nimbus works alongside IBM in clients - but unfortunately Nimbus Control is not deployed internally by IBM. It could have been. Cognos had a very effective deployment of Nimbus Control to document and drive their product implementation processes. But when Cognos was acquired by IBM, Nimbus was a casualty of the M&A "synergistic savings".

Suddenly, it doesn't sound like much of a saving when you look at the likely compensation claim by DBS Bank.

Sour grapes or sound thinking???

For a longer discussion on process adoption (with a less cynical tone) read my blog http://bit.ly/dsQiUI

The Swiss Cheese Model - the holes lined up and we got a 7 hour outage

Highlights an issue I am often accussed of "winging about",..

in that DR detection and recovery is not done in pleasant conditions or in a fresh state of mind. I use scripts and dashboards to highlight current status of components, and a load of scripted utilities to bring up services and test each response, reporting back when anyting awry is found. Saved my bacon more than once.