Thoughts on Reliability

I grew up in the Bell System, where reliability was second only to safety as an overriding concern. If a phone company employee caused a 15-second service outage, they’d certainly get counseled by their manager, and perhaps referred to additional training. If he caused a 15-minute outage, he’d get reviewed in front of his manager’s manager. Not punitive, but trying to figure out what went wrong, how to capture the failure in an improved process, and how to get that information to others who might encounter a similar problem in the future.

Well, email is far more important to me now than telephone service was then. Last night, our IT staff started a server upgrade to patch against the “Conflicker” virus at 5:00 pm. It crashed the machine, and service wasn’t restored for 15 HOURS. I got a note of apology at 8:00 am that translated basically as “Stuff happens.”

When did we decide that this was acceptable? Why do we put up with this?