Insights

Learn how to manage mission-critical platforms with Google practices

You may have wondered what is the secret of companies like Netflix, Google, Amazon, AWS, AirBnb, to deliver excellence and quality in the user experience of their services, is not it?

To answer this question you need to understand what is the most important feature of a product. Imagine that you are Netflix user, and hypothetically, someone gives you access to two different versions of the product, as shown below. Which one do you prefer?

Netflix of the year 2016?

Netflix 2019, but with the error below?

If I had to bet, I’d say you’d choose Netflix 2016, would not you? This little exercise shows that for the customer, the most important feature of a product is that it works. However, make no mistake, even though this is the main functionality of a product, it is common for people to forget and only pay attention to it when things start to fail.

Before we get into the topic in question, I find it important to make clear the difference between Availability and Reliability. The first one deals with the time that your service is in the air, ready to meet any requests. The second one refers to the quality of users’ interactions with their system.

Considering this, you can then conclude that one should seek 100% reliability and availability, right? Wrong!

For those who have studied a little on the subject, the table below is very familiar and shows how long your service may be unavailable within the time analyzed.

Nível de Disponibilidade

If we are looking for availability of three 9 (99.9%), for example, within a month, we can be unavailable for about 43 minutes. For each “9” most desired, there is also an exponential cost of associated resources. This cost is financial ($$$) and also of people.

As much as you can achieve these hypothetical 100% availability, your users will not feel this experience in full, after all, how many times have you tried using a service, and your cell phone, computer or internet was not working?

Chances are that if your goal really is to achieve 100% availability, you will never launch new features or improve your service. The trend is that your product will stagnate.

You need to find the right balance between investing in features that will win new customers or keep current ones versus investing in reliability and scalability that will keep the same customers happy.

Mais experimentos x Mais Confiabilidade

Now imagine that this challenge can be solved from a math problem? Imagine that you can have a metric that can align interests between the operating teams, development teams and even the product team?

Service Level Objective (SLO) defines a reliability goal that an application needs to achieve to meet the needs of its users in the business. It is a measure of how well served your users are and how your product is working properly. Some examples of SLOs are shown below:

99.9% of total HTTP requests respond successfully

80% of total HTTP requests return in less than 700ms

95% of the critical job runs ABC of the batch fabric processes in up to 4 hours

Availability of 99.5% (3.6 hours of unavailability)

Ideally you should set goals that your customers really care about, or rather work toward goals that are directly linked to (good) user experience. For example, you should never use CPU consumption as an SLO metric, but rather consider the system time it takes for the user to perform a given action.

“SLO is about keeping your users happy”

You can implement SLO today for your application, but that’s just the foundation that allows you to respond to real emergencies.

Therefore, we need to move further to find the balance in decision-making. It is necessary to have consequences when the target SLO is not reached, and this consequence is not in the sense of punishment (for example, in finding guilty or even financial), but as an opportunity to improve its service and enable its staff to move as fast as possible without losing quality.

This is where the concept of Error Budget comes in, which is basically the difference between the perfect reliability and the agreed SLO over a defined period:

Error budget = (1 – SLO)

So if I have a SLO of 99.9%, there is a 0.1% error percentage. Bringing for a more real example, for a service that performs 1B requests / month, you have a quota of 1M errors/month.

Deploying the Error Budgetconcept shows that there is a small, but not zero, acceptable amount of failures. Your business decisions will be data driven and with a metric that aligns the interests of all the tribes involved in the business and especially puts the customer (and the experience of using it) in the center.

By adopting the SLO / Error Budget approach, it is possible to resolve most of the conflicts between new releases of the application and the stability expected by the executives.

This work method was created by Google, and serves as the basis for every SRE – Site Reliability Engineering framework. It’s a real revolution in how you operate and manage your operation.

If you want to understand a little more about SRE or even start practicing and implementing the SLO / Error Budget approach in your business, talk to us.