Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

Within the incident response team, the incident commander has the most critical role. They are responsible for running the team's designated incident response process. Arundel notes that:

The key thing is to have one person in charge. You need a decision maker. Often, this will be the team lead, but over time, you should make sure to give everybody a turn in that chair.

The next role that Arundel recommends is the communicator. The communicator's job is to provide status updates both internally and externally. This includes updating management, project managers, and the impacted clients. Supporting the communicator is the records person whose responsibility is to document everything as it happens, including taking notes, capturing screenshots, and collecting log data and metrics for future analysis. The final role that Arundel recommends is the researcher. Their responsibility is to hunt down answers to questions as they come up in the incident response process.

This matches closely with how Netflix runs incidents as seen in the recent open sourcing of their incident management tool Dispatch. Dispatch can automatically assign an incident commander based on the type, priority, or description of the incident. Dispatch can also facilitate communications by allowing for notifications to happen on a cadence removing the need to have a human remember to send them out.

As the team becomes better at resolving incidents and mitigating the issues that led to them, they may need other ways to ensure they are prepared. As Arundel states, "The more reliable your systems, the less frequently real incidents happen, so the more you need to practice them." This is where he recommends using red team versus blue team exercises. This concept, which originates in military exercises and is heavily used in information security, has one internal team take on the role of "attacker". Their job is to create an incident that the blue team needs to respond to. This is similar to the concept of game days in which a failure is simulated within the environment to allow for testing systems, processes, and team responses.

Adrian Cockcroft, VP cloud architecture strategy at AWS, shares this sentiment and believes that adopting a "learning organization, disaster recovery testing, game days, and chaos engineering tools are all important components of a continuously resilient system."

Arundel shares some tips for teams looking to host their first game day: "Keep it short and simple the first time round. Put together a basic plan for what you’re going to do: this is the first draft of your incident handling procedure." As the team becomes more practiced, he recommends starting to assign the various roles. For the first attempts at practice incidents, he advises keeping the exercise to around one hour in length. Finally, he feels that moving the debrief to the day after will provide a better experience as the team will have had time to reflect on their actions and learnings.

Eugene Wu, director of customer experience at Gremlin, shares a number of the same tips. He also adds the importance of clearly identifying up front the purpose of the game day and which scenarios are going to be tested. This allows for clearly identifying the correct individuals to be involved, both on the execution and the response sides. He also suggests scoping out the test cases to better define the perceived impact and extent of the potential blast radius. Finally, he recommends having a clear exit strategy in case the experiment needs to be aborted quickly.