How Not to Go Boom: Lessons for SREs from Oil Refineries

Wednesday, March 28, 2018 - 5:15 pm–5:35 pm

Emil Stolarsky, Shopify

Abstract:

Bad software doesn’t explode. You can describe it as exploding when it throws an exception, corrupts some data, or makes your computer unusable, but it doesn’t explode. When code doesn’t work, the solution is to figure out where the logic is incorrect and fix it. While SREs may be called engineers, we rarely face the consequences of engineers in other industries.

In contrast, when a chemical engineer makes a mistake designing a refinery, the consequences are very different. We’ve all seen videos of the repercussions online. Big, loud explosions reducing massive facilities to chunks of twisted metal. The reality is working with unstable chemicals is a lot harder than keeping track of pointers in C.

Yet despite the differences, industrial process plants can be surprisingly similar to a complex software system. Where refineries will use pressure relief valves, web services will degrade gracefully. Regardless if you’re protecting against thermal runaway in a plant or a cascading failure in a data center, the fundamental ideas can be shared by both domains.

In this talk, I’ll explore the techniques and ideas used to build and operate refineries and how we can use them to make our software systems more resilient and reliable.

Emil is a production engineer at Shopify where he works on scriptable load balancers, performance, and DNS tooling. When he's not trying to make Shopify's global performance heat map green, he's fighting his fear of heights in a nearby rock climbing gym.

SREcon18 Americas Open Access Videos Sponsored byIndeed

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.