Create a Culture of Strength: Resilience Engineering

Written by Stefan Thorpe

As the saying goes, “The best defense is a good offense.” An adage true for both football and software development. When it comes to organizations protecting themselves against disruptions, the tendency is to bulk up post-disaster, rather than move on the offensive beforehand. In manufacturing terms, this can be in the form of an inventory stockpile (which incurs some quite obvious costs in the way of storage) or spending more money on equipment/people/or floor space in reaction to machinery failure.

Reactions to IT catastrophes are very similar. When disasters happen, the business instinct is to throw money, time, and people at the problem until it’s fixed. All costly responses. But with governments and organizations becoming increasingly reliant on technology, the stakes are raised even further than before. Software failures are no trivial matter; whole businesses—and even people’s lives—are at stake.

IT disasters cause the same habitual reaction as they do in manufacturing. Add further buffers into an already complex system in an attempt to prevent the same disruptions from repeating.

Whether your team is working in a manufacturing value stream or a technology one though, move on the offensive and take proactive steps to reduce these disasters.

Post-Disaster

DevOps encourages teams to take a systemic view following IT disasters. Rather than looking for who to blame, DevOps practices provoke an examination of all the factors—both human and technical—that contributed to the failure within a faultless setting. Operations like the blameless post-mortem seek to examine ways to mitigate repeating issues involving reliability, resiliency, security, and cloud service recoverability with everyone involved—minus the “should have done this” mindset.

Once you achieve this goal, the next exercise should involve generating small incremental changes and tasks for achieving positive future countermeasures.

Resilience Through Destruction

Designing fault-tolerant architecture is not enough to prevent IT disasters. While cloud-based infrastructure is all about redundancy and fault-tolerance, there is no way to guarantee 100% uptime. Systems must be stronger than their weakest link. To achieve this state, teams must become better at problem-solving through self-diagnostics and self-improvement by learning from failures and mistakes alike. Only once there is strength in the working culture to accept and move past disasters without apportioning blame will technicians have the confidence to push further with actions to prevent disastrous events in the future.

Monkey Business

Netflix’s “Simian Army” is showcased so often as the model case study for resilience engineering that I won’t bore you too much with an overly in-depth look. For those who haven’t come across the renowned primate suite of resilience engineering tools before, here’s the catch up:

Originally, the Simian Army was an internal suite of Netflix tools. These tools saved the company from the 2011 AWS EAST Outage that dramatically affected other internet-based businesses at the time. The Army’s mission is to keep the cloud operating in top form. All by randomly disabling production instances on an AWS infrastructure. Chaos Monkey, the principal member of the Simian Army, is a resiliency tool that ensures applications such as Netflix can tolerate large-scale fault injection. The process happens within a carefully monitored environment. Doing this exercise regularly, allows IT teams to build up the necessary reactions and recovery mechanisms for unplanned disruptions.

Furthermore, the practice is suitable for more than just cloud-based companies; adapt the same resilience engineering for traditional corporate IT environment too. Because let’s face it, everyone experiences IT disasters.

Caylent offers DevOps-as-a-Service to high growth companies looking for help with microservices, containers, cloud infrastructure, and CI/CD deployments. Our managed and consulting services are a more cost-effective option than hiring in-house and we scale as your team and company grow. Check out some of the use cases and learn how we work with clients by visiting our DevOps-as-a-Service offering.