Incidents/Downtime

What does it mean to your client when you place a static switch or UPS in bypass? Most clients won’t understand that it means their business is now at the mercy of the utility supplying power. Most clients won’t associate risk with the statement at all — let alone imagine their company on the front page of the Wall Street Journal because of an outage. When you communicate with your client about maintenance or your process, is there any way to know what they really understand? Many years ago when we were running one of our first data centers, we tried to come up with a way to relate what we did in facilities to our clients. Shawn Patrick came up with an idea that we started calling “Level of Readiness.” We rated the risk to our customers based on our equipment and process conditions. Ever since, I have used a very similar idea to communicate to the clients of data centers the level of risk our operations pose to their processes, systems, and business....

It’s not uncommon for data centers to spend hundreds of thousands of dollars a year maintaining vital equipment. A typical maintenance budget can run 1 percent to 3 percent of your initial capital investment for each year of operation, and this amount goes up as the equipment ages. It’s a necessary evil if you expect high levels of reliability in your data center operations. But curiously, in this industry (and many others, by the way) the budget allocated to maintain their most important asset is usually a mere fraction by comparison. Are maintenance budgets misallocated? It’s well known that human-caused downtime is one of the largest reasons for downtime – if not the largest reason. It’s just an observation, but oftentimes in my personal experience, I see an allocation of budgeted resources that doesn’t align with what we say is important. Many budgets look like this: Equipment maintenance budget – $600,000/year Training for technicians budget – $10,000/year (if that)...

I can only speculate as to what caused Amazon’s latest outage, an apparent “loss of power.” But this week, I’m going to express my opinions in no uncertain terms – fair warning. In my experience, most organizations actually CHOOSE to have outages. I don’t care what their sales slogans promise. They choose to have outages. If you don’t believe me, just read their SLAs (Service Level Agreements). Most offer some sort of guarantee of uptime or service availability. Amazon guarantees 99.95 percent uptime – or about 0.72 minutes of downtime a day. It translates to more than four hours a year. Beyond that, most will give you “credit” toward the loss of service with either billing credit or more services. So as long as the outage is less than four hours per year, no foul. You might even get a “We’re sorry.” Rackspace offers a 100 percent uptime guarantee but will only reimburse 5 percent of your monthly fee for every half hour of outage. So if you have 10 hours of downtime, you don’t have to pay the monthly fee. Not a great option if your business is global and your average revenue is a million dollars/hour....

Santiago Botero is a Colombian professional bicycle road racer. He’s best known for winning the mountains classification in the Tour de France, and the World Championship Time Trial. During the 2000 Tour de France, he kept a daily diary of his thoughts and progress for a newspaper back in Colombia. What follows was his entry for a part of the race that took place in the mountains: “There I am all alone with my bike. I know of only two riders ahead of me as I near the end of the second climb on what most riders consider the third worst mountain stage in the Tour. I say ‘most riders’ because I do not fear mountains. After all, our country is nothing but mountains. I train year-round in the mountains. I am the national champion from a country that is nothing but mountains. I trail only my teammate, Fernando Escartin, and a Swiss rider....

The lights flicker, your phone starts making “message received” sounds, and the radio crackles with excited voices. You recognize that something is not as it should be at the facility and you’re the person on duty with the responsibility to respond. It becomes apparent that the power system is in distress. The orders come over the radio to shift the “E” lineup to backup. You run to the “E” power room and quickly move the switch to the backup power supply position. You hear the breakers actuate, and then the unthinkable happens – the lights go out. The ironic thing is that shortly after you turned the switch, your mind actually was pondering the possibility that you could have heard “D” instead of “E.” And sure enough, the actual order, as it turns out, was to place D into backup and not E. Your actions caused a loss of power to the facility, compounding the initial problem....

I’m often asked, “What is the single, most effective thing we as leaders can do to eliminate human-caused downtime?” My answer is that the leader must be the example and never the exception. I say that because leaders occupy a very special place in the sociology and group dynamics of an organization. Consequently, the degree to which their behavior is viewed, scrutinized, and mimicked is amplified, sometimes exponentially. Sir Isaac Newton explained that, in physics, for every action, there is an equal and opposite reaction. When it comes to human behavior, the idea is much the same except the reaction may not be equal or opposite. We are, after all, a little more complex than inanimate objects. The concept in human behavior that corresponds to this law of physics is: If you want to change behavior in an organization, the leadership of the organization must change their behavior. Perhaps an experience I had in the Navy would illustrate what I mean. I was initially stationed to a submarine that was in overhaul, being basically rebuilt in a shipyard. We were getting to the end of the overhaul and it was necessary to clean the ship to put it back to fighting condition. For those of you that are familiar with construction sites, you know that dirt and debris are everywhere. The same is true for a submarine during overhaul. We had a big cleaning job to do. My assignment was to clean the bilges in an area under the diesel generator, one of the worst areas. I had to put on coveralls, climb under the diesel into the bilges, and clean out all the debris and dirt. It was a miserable job, to say the least. I had a helper who climbed under the diesel with me to help out. We spent a good two hours working under the diesel and found ourselves even joking about the stuff we might find and whether it would be alive or …?...