DevJam

Feb Jam Session w/ Casey Rosenthal Recapitulated

Casey Rosenthal, engineering manager of Netflix’s Chaos and Traffic Teams, joined us via Google Hangout for our February Jam Session. (Our first two floor Jam Session!) Netflix’s core values of Freedom and Responsibility were on full display. The Chaos team’s very existence is a true paragon of those values. At the heart of Casey’s talk was Chaos and Intuitive Engineering.

First Intuitive Engineering: Because Netflix’s system is so huge they can only hope to have a holistic understanding of their complex microservice architecture. “At any given time, we may be called upon to move the request traffic of many millions of customers from one side of the planet to the other.” This requires something visual with the capacity to display information in a Tufte-like elegant and informative display. As the story goes, Casey grabbed a designer and an engineer and had them make something that could display real-time data graphically (and “it had to look cool”). “Instead of numerical information, we want a tool that surfaces relevant information to a human, for situations that would be too onerous to create a heuristic. These situations require an intuition that we can’t codify.”

In the video above “the circle in the center represents the Internet. The moving dots represent requests coming in to our service from the Internet. The three Regions are represented by the three peripheral circles. Requests are normally represented in the bluish-white color, but errors and fallbacks are indicated by other colors such as red.” At the beginning you see regular traffic. Then “you can see request errors building up in the region in the upper left [victim region] for the first twenty seconds or so. The cause of the errors could be anything, but the relevant effect is that we can quickly see that bad things are happening in the victim region.

“Around twenty seconds into the video, we decide to initiate a traffic failover. For the following 20 seconds, the requests going to the victim region are redirected to the upper right region [savior region] via an internal proxy layer. We take this step so that we can programmatically control how much traffic is redirected to the savior region while we scale it up. In this situation we don’t have enough extra capacity running hot to instantly fail over, so scaling up takes some time.”

They call this view into their system Flux.

So then what exactly is Chaos Engineering? “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production,” because turbulent conditions in production are a ‘when,’ not an ‘if.’ The Chaos Engineering team began with Chaos Monkey, which “is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group.” When that experiment turned out to be hugely successful, they expand their scope: Enter Chaos Kong. Rather than shutting down small instances, Chaos Kong simulates shutting down whole regions in production. Service interruption is guaranteed and Netflix can confidently say they’re ready, as shown in the video above. This is all really Nassim Taleb’s idea of Antifragile applied.

As attendee Dan Wick put it, “Learning about how Netflix structures their engineering teams with each microservice was extremely insightful: 4 devs to 1 manager on each service. Their concept of Intuitive Engineering was also fascinating in how they use real time visualizations to easily see the health of their services.”