Topics

Featured in Development

Understandability is the concept that a system should be presented so that an engineer can easily comprehend it. The more understandable a system is, the easier it will be for engineers to change it in a predictable and safe manner. A system is understandable if it meets the following criteria: complete, concise, clear, and organized.

Featured in Architecture & Design

Sonali Sharma and Shriya Arora describe how Netflix solved a complex join of two high-volume event streams using Flink. They also talk about managing out of order events and processing late arriving data, exploring keyed state for maintaining large state, fault tolerance of a stateful application, strategies for failure recovery, data validation batch vs streaming, and more.

Featured in Culture & Methods

Tim Cochran presents research gathered from ThoughtWorks' varied clients and projects, and shows some of the metrics their teams have identified as guides to creating the platform and the culture for high performing teams.

Exploring Costs of Coordination During Outages with Laura Maguire at QCon London

Laura Maguire talked at QCon London[slides] about how the coordinative efforts during outages cause a high cognitive cost. Maguire found out that coordination during anomaly response is difficult, that existing models can undermine speedy resolution, and that the strategies to control the cost of coordination are adaptive to the type of incident. Moreover, tooling has additional costs of coordination, even when it's intended to reduce them.

Maguire was part of the SNAFUcatchers consortium, which is a consortium composed of many organizations that are interested in resilience engineering. She had access to data that allowed her to explore the hidden costs of coordination during outages when studying the incident command system (ICS) model. ICS is a standardized approach to manage emergency responses to an incident by having a hierarchy within the incident responders–the people who're solving the incident. The basic flow is as simple as figuring out the problem, repairing it, and moving on. However, this model is not a good fit for systems that continuously change.

Incident response in software engineering is very different from other worlds, mainly because the software is complex and lives in a continually changing environment. Therefore, in software, there's the implicit need to learn about how systems work continually. Consequently, the type of failures organizations face are often quite challenging with broad consequences, and they require multiple forms of expertise. So, incidents need different people to handle events, but they also have a high cost in terms of attention, said Maguire. And, the ICS has hidden costs that organizations usually see as a burden.

So, according to Maguire, the alternative of the ICS is:

The ability to seamlessly synchronize activities in a larger joint effort is quite meaningful. And we can see that if it typically runs smoothly, but each agent in this sort of distributed network can more fluidly adapt and adjust to the demands, we can lower the costs of coordination.

Maguire calls this ability "adaptive choreography," which is about being able to adjust how coordination happens dynamically. The role of the incident commander is still essential. There are times where decisions need to be made quickly. The team needs to know who that centralized authority is, one that has the bigger picture in mind, said Maguire.

So, after evaluating many high-performance teams, Maguire found out that:

In high-pressure events, very rapid, but straightforward interactions amongst the (incident) responders typically worked really well. And that's because they fulfilled the functional requirements of coordination, as they were carrying out their tasks. They're able to anticipate what needed to be done next, and they were able to take the initiative to do it. They listened in on what others were doing and they were able to better sequence the timing of those actions. So, they were able to provide input into critical decisions and point out potential threats and the implications of different courses of action.

Moreover, Maguire said that the tools a team uses to solve incidents have a substantial cost of coordination. For instance, a person having lag or delays in the web conference call can add additional cognitive demand. Or, the time spent selecting the tool that better fits at the moment, because it could mean that the organization has to adapt to a new form of coordinating after adopting a particular tool. All of these problems represent the hidden costs of coordination, and Maguire said that it's important to acknowledge these additional costs for cognitive and coordination demands.

Finally, Maguire closed by saying:

I believe that software engineering could lead a step-change in incident management practices, and it's my hope that you'll continue to push the boundaries of what is possible in how we coordinate.