Properties of good SLI metrics

In this section, we will talk about how to choose good metrics(SLI). First of all, let’s think about our target. We need to make our users are happy. Good SLI should have some characteristics:

predictable relationship with user happiness

shows service is working as users expect to

expressed as: good events / valid events

aggregated over a long time horizon

How do you think why the upper metric is bad? Firstly, it makes a relationship with happiness unpredictable. Secondly, it is hard to measure it accurately. For bad metrics, it’s hard to make a threshold between normal operations and during the outage because it has a high chance to get false positive or false negative results.

Ways of measuring SLIs

There are 5 ways to measure SLI. Each of them has pros and cons.

Request logs

It allows retroactively fill your SLI if you have not been measuring it directly. But it has significant latency between event occurred and measured it in your SLI. It’s unsuitable for emergency response.

Application metrics

Application metrics do not measure multi-request sessions but only one request scope because it’s difficult to export them from stateless servers.

Front-end infra metrics

You can measure metrics from your front-end balancer. It’s the closest place to a user which in your control. You or your cloud provider will have detailed historical metrics. This approach does not require the engineering effort to get started. The cons are these servers are stateless and they don’t have insight into the response data. They must rely on the meta response envelope that these responses are good.

Synthetic clients

Synthetic clients are good to check that your service works as expected in general cases but user behavior often is unpredictable and you cannot fully rely on that metrics. Covering all scenarios could require huge development investment.

Client-side instrumentation

Finally, you can go to the source of user experience and measure it on client-side which provides more accurate measurements. In this case, many factors lay down out of your controls, especially on mobile devices.

Data processing SLIs

We will cover some types for SLI which could be useful in your system.

Freshness

When we bach process data it is a good idea to measure threshold between when you get data and you show it to your customers. It could be expressed as the proportion of valid data updated more recently than a threshold

Correctness

Correctness is expressed as the proportion of valid data producing correct output. We need to choose a valid input and compare a given output with a valid output.

Coverage

Coverage is expressed as the proportion of valid data processed successfully. In this case, we need to determine how many requests we processed successfully.

Throughput

The proportion of time where the data processing rate is faster than a threshold. If throughput is falling down you might violate someone’s expectation.

Setting Reliable Targets

Therefore after choosing our SLIs how to choose targets for them? First of all, you already have metrics and you have users let’s consider that your users are happy now. It’s a great way to get started. You can check historical data and choose what you believe a good SLI. If you don’t have historical data start gathering it after some period of time you can choose the right targets. SLOs based on historical data are called achievable SLOs. The disadvantage of these that you have to be sure that your users are happy with the past experience.

Aspirational SLOs are SLOs which based on business requirements. If you know that your users are unhappy you need to set you aspirational SLOs higher than your achievable SLOs. If you don’t have historical data at all try to ask your product team which SLOs make users happy.

More important to start gathering data and set reasonable targets that setting right targets at the first time. Because you need to adjust your targets regularly.