Chaos engineering technique

One of the biggest motivations to attend Software Crafters Barcelona this year was the open spaces session. In these open spaces, we discussed very interesting topics like “feature branches vs. continuous integration”, “monoliths vs microservices“, “how to manage diversity in our workplace”, Also, we shared the disastrous experiences we have had in production. And this article will be focused on this last point, since one of the strategies discussed in this open space and that surprised us the most was “Chaos engineering”.

This technique is based on testing our production environment causing intentional failures in it, so we can know how our system reacts to these failures in a controlled environment. For example, disconnecting one of the microservices of our system, knocking down a database, or simulating a fall in the network.

If we carry out this technique, although at first sight it may seem dangerous, we must assume that the benefits obtained from it will outweigh the risks; since the fact of causing a failure in our system will allow us to observe which parts of it behave correctly and which do not, and we can be prepared when this failure occurs in an uncontrolled environment.

Carrying out these experiments is not a trivial task, some steps are defined to have a safe execution and valid results. In this way we can take the appropriate actions when solving the errors we have found. These steps are defined in the Principles of chaos web.

Chaos engineering steps

Define a correct state of execution (steady state) of our system. We need to define some execution parameters of our system (for example certain values of certain metrics) that define what is working in the expected way to be able to measure the differences when we cause the failure. Also, we will need a control group and an experimental group; in the first group we will maintain a normal execution and in the second group we will do the experiment.

We define the hypothesis that this “correct” execution status will continue in both groups.

We perform the experiment. We introduce the variables in our system that reflect the events that we want to simulate, such as the fall of a server or a database.

We look for differences of operation in the metrics comparing them with those defined in the first step, and comparing them also with the control group. In this way we try to refute the hypothesis defined in step two.

As programmers, we always try to make our code free of errors and as stable as possible, trying to anticipate all the errors that may occur. Chaos engineering goes a step further and talks about observing how our system behaves when we have failures, and if we have controlled them correctly we will be safe, but if we have not done it, we will see where are these failures to be able to fix them.

If you would like to know more about Chaos engineering, I highly recommend you to subscribe to our monthly newsletter by clicking here.

And if you found this article about Chaos engineering interesting, you might like…