Christmas, failure, and the exascale

Dan Reed was up late last night penning a new blog post that starts, appropriately for these post-Halloween weeks, with that most horrible chore of the year

I speak, without hesitation or ambiguity, regarding the curse (sometimes literal) of serially wired holiday lights. My father and mother kept a healthy supply of spare bulbs for laborious and sequential replacement and testing whenever a single bulb failed, and a strand of thirty bulbs went dark. I still remember my excitement and delight when we first purchased strands with parallel circuit wiring.

…If there is an Aesop-like moral in this tale from my childhood, it relates to systemic design for resilience rather than component resilience alone. Parallel circuit resilience trumps serial circuit resilience, and the extra cost is repaid in greater systemic reliability. Alas, I fear we have not learned this lesson in parallel computer system design and parallel programming models and applications.

Dr. Reed goes on to draw an analog between our highly failure-intolerant programming schemes, and those serial Christmas lights. With enough lights (processes), you are guaranteed to spend a lot of your holiday season with dark spots on the tree. We need to inject failure tolerance into our application programs and system stacks, and Dan suggests a place to start looking for prior art: large scale cloud computing

To understand this potential shift in perspective, I heartily recommend Werner Vogels’ analysis of the power of eventual consistency for large-scale web services at Amazon. Eric Brewer’s thoughts on the CAP theorem, drawn from his Inktomi experiences, have also shaped theoretical and empirical assessments of large-scale system reliability. For those not familiar with the CAP theorem, it postulates that one can choose any two of Consistency, Availability or Partition tolerance. More generally, it offers a framework to reason about conflicting objectives.

He avoids claiming cloud as panacea, but he is right that there is a lot of IP already developed that our community should be carefully studying.

Resource Links:

Latest Video

Industry Perspectives

In this Nvidia podcast, Bryan Catanzaro from Baidu describes how machines with Deep Learning capabilities are now better at recognizing objects in images than humans. “AI gets better and better until it kind of disappears into the background,” says Catanzaro — NVIDIA’s head of applied deep learning research — in conversation with host Michael Copeland on this week’s edition of the new AI Podcast. “Once you stop noticing that it’s there because it works so well — that’s when it’s really landed.” [Read More...]

White Papers

This white paper reviews common HPC-environment challenges and outlines solutions that can help IT professionals deliver best-in-class HPC cloud solutions—without undue stress and organizational chaos.