Presentation: Too Big to Fail: Lessons from Google and healthcare.gov

Location:

Salon D

Day of week:

Thursday

11:50am - 12:40pm

Failure is a fact of life, so we design our system to be fault-tolerant at all levels. In practice, however, some components almost never fail. As the product grows, these components are increasingly stressed in new and different ways; when they ultimately do fail they create outages for which we are unprepared. We thought we were designing for failure, but the design didn't include failures at this level. At Google, some of our most exciting production snafus involve large and unpredictable network-level failures; at healthcare.gov in late 2013, just about every component fell into this category on a daily level.

Through stories of large-scale Google outages and smaller-scale healthcare.gov outages, we’ll illustrate situations we’re often flying blind to and draw lessons from them about how to expose unknown weak points in our systems. We’ll discuss the importance of being able to model systems ahead of time and visualize solutions in real time (including during an outage). Attendees will learn a practical framework for anticipating potential large-scale outages and specific ways to increase systemic robustness, for example “practicing disaster”. Failure -- even large failure -- is a fact of life; outages don’t have to be.

Speaker: Nori Heikkinen

Google Site Reliability Engineering Expert

Nori has been an SRE at Google for the last 7+ years, working with her team to manage the invisible layers of load-balancing and proxying infrastructure shared by all internal products. She has both saved the day and created or exacerbated outages, sometimes both at once. Nori also took a leave of absence in early 2014 to work on the reliability of healthcare.gov, gaining a completely fresh perspective on what it really takes to sustainably keep a website running in the absence of existing team culture.