Why computers fail

Summary:Good failure data for PCs is hard to find: who knows how many times PC users are told to reinstall Windows? But in a recent paper, Bianca Schroeder and Garth Gibson of CMU found some surprising results in 10 years of large scale cluster system failures at Los Alamos National Labs.

Good failure data for PCs is hard to find: who knows how many times PC users are told to reinstall Windows? But in a recent paper, Bianca Schroeder and Garth Gibson of CMU found some surprising results in 10 years of large scale cluster system failures at Los Alamos National Labs.

Among the surprises: new hardware isn't any more reliable than the old stuff. And even wicked smart LANL physicists can't figure out the cause for every failure.

Special problems of petascale computing
Despite the incredible performance of Roadrunner, LANL's new petaflop computer, the jobs it runs often take months to complete. With 3,000 nodes failures are inevitable.

What to do?
LANL's strategy is stop the job and checkpoint. When a node fails they can roll the job back to the last checkpoint and restart, preserving the work already done - but losing the work done after the checkpoint.

Even using massively parallel high-performance storage the checkpoints take time away from getting the answer. Understanding Failures in Petascale Computers uses LANL's data to better manage the tradeoffs and to suggest new strategies.

But its the failure data itself - and what it suggests about our own computers - that I found most interesting.

Failure etiology
Hardware accounts for over 50% of all LANL failures - with software about 20%. Given all the PhD's at LANL you'd hope human error would be low on the list - and it is.

Here's the graph:

Is reliability improving?
Nope. LANL hasn't seen any improvement over the years - even with hardware from a decade ago.

The key metric
The research showed that

. . . the failure rate of a system grows proportional to the number of processor chips in the system.

Which is a big problem for massive multi-processor systems.

The Storage Bits take
Extrapolating these results to our desktop systems is straightforward - with one big caveat: most desktop system crashes are software, not hardware.

Otherwise the Blue Screen of Death would be the No Screen of Death.

The biggest finding is that we shouldn't expect our system hardware to get more reliable. Improvements get balanced out by increased complexity.

Those of us with multi-processor systems can expect to see less reliability - though with just a few systems you won't see any trends. It's a classic "glass half full" situation: our systems won't get better, but al least they won't get worse.

Robin Harris is Chief Analyst at TechnoQWAN LLC, a storage research and consulting firm he founded in 2005. Based in Sedona, Arizona, TechnoQWAN focuses on emerging technologies, products, companies and markets. Robin has over 35 years experience in the IT industry and earned degrees from Yale and the University of Pennsylvania's Wharton...
Full Bio

Disclosure

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in Sedona, Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 30 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about.
Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies.
Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree.
In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use â€˜em or lose â€˜em!