Incident Response with Emil Stolarsky

As a system becomes more complex, the chance of failure increases. At a large enough scale, failures are inevitable. Incident response is the practice of preparing for and effectively recovering from these failures.

An engineering team can use checklists and runbooks to minimize failures. They can put a plan in place for responding to failures. And they can use the process of post mortems to reflect on a failure and take full advantage of the lessons of that failure.

Emil Stolarsky is a production engineer at Shopify where his role shares many similarities with that of Google’s site reliability engineers. In this episode, Emil argues that the academic study of emergency management and industries such as aerospace and transportation have a lot to teach software engineers about responding to production problems.

In this interview with guest host Adam Bell, Emil argues that we need to move beyond tribal knowledge and incorporate practices such as an incident command system and rigorous use of checklists. Emil suggests that we need to move beyond a mindset of “move fast and break things” and toward a place of more deliberate preparation.

Show Notes

Sponsors

Our sponsor, Datadog, is a monitoring platform that helps teams identify, investigate, and resolve issues quickly, all in one place. Datadog integrates seamlessly with more than 200 technologies, including AWS, Docker, PagerDuty, and Slack. With powerful dashboards, sophisticated alerts, and distributed tracing and APM, Datadog provides deep visibility into your applications and infrastructure. But don’t take our word for it—start a free trial today & Datadog will send you a free T-shirt! Visit softwareengineeringdaily.com/datadog to get started.

Dice helps you accelerate your tech career. Whether you’re actively looking for a job or need insights to grow in your role, Dice has the resources you need. Dice’s mobile app is the fastest and easiest way to get ahead. Search thousands of tech jobs – from software engineering to UI/UX to product management. Discover your worth with Dice’s Salary Predictor based on your unique skill set. Uncover new opportunities with Dice’s new career pathing tool which can give you insights about the best types of roles to transition to – and the skills you’ll need to get there. Manage your tech career and download the Dice Careers app on Android or iOS today. So check out Dice and support Software Engineering Daily, go to Dice.com/sedaily. Thanks to Dice for being a sponsor of Software Engineering Daily.

Incapsula can protect your API servers and microservices from responding to unwanted requests. To try Incapsula for yourself, go to incapsula.com/2017podcasts and get a free enterprise trial of Incapsula. Incapsula’s API gives you control over the security and performance of your application–whether you have a complex microservices architecture or a WordPress site, like Software Engineering Daily. Incapsula has a global network of over 30 data centers that optimize routing and cache your content. The same network of data centers that are filtering your content for attackers are operating as a CDN, and speeding up your application. To try Incapsula today, go to incapsula.com/2017podcasts and check it out. Thanks again, Incapsula.

Simplify continuous delivery with GoCD, the on-premise, open source, continuous delivery tool by ThoughtWorks. With GoCD, you can easily model complex deployment workflows using pipelines and visualize them end-to-end with the Value Stream Map. You get complete visibility into and control of your company’s deployments. At gocd.org/sedaily, find out how to bring continuous delivery to your teams. Say goodbye to deployment panic and hello to consistent, predictable deliveries. Visit gocd.org/sedaily to learn more about GoCD. Commercial support and enterprise add-ons, including disaster recovery, are available.