Redis latency monitoring framework

Redis is often used in the context of demanding use cases, where it
serves a big amount of queries per second per instance, and at the same
time, there are very strict latency requirements both for the average
response time and for the worst case latency.

While Redis is an in memory system, it deals with the operating system
in different ways, for example, in the context of persisting to disk.
Moreover Redis implements a rich set of commands. Certain commands are
fast and run in constant or logarithmic time, other commands are slower
O(N) commands, that can cause latency spikes.

Finally Redis is single threaded: this is usually an advantage from the
point of view of the amount of work it can perform per core, and in the
latency figures it is able to provide, but at the same time it poses a
challenge from the point of view of latency, since the single thread
must be able to perform certain tasks incrementally, like for example
keys expiration, in a way that does not impact the other clients that
are served.

For all this reasons, Redis 2.8.13 introduced a new feature called
Latency Monitoring, that helps the user to check and troubleshoot
possible latency problems. Latency monitoring is composed of the
following conceptual parts:

Latency hooks that sample different latency sensitive code paths.

Time series recording of latency spikes split by different event.

Reporting engine to fetch raw data from the time series.

Analysis engine to provide human readable reports and hints according
to the measurements.

The remaining part of this document covers the latency monitoring
subsystem details, however for more information about the general topic
of Redis and latency, please read the Redis latency problems
troubleshooting page in this documentation.

Different monitored code paths have different names, and are called
events. For example command is an event measuring latency spikes
of possibly slow commands executions, while fast-command is the
event name for the monitoring of the O(1) and O(log N) commands. Other
events are less generic, and monitor a very specific operation performed
by Redis. For example the fork event only monitors the time taken by
Redis to execute the fork(2) system call.

A latency spike is an event that runs in more time than the configured
latency threshold. There is a separated time series associated with
every monitored event. This is how the time series work:

Every time a latency spike happens, it is logged in the appropriate
time series.

Every time series is composed of 160 elements.

Each element is a pair: an unix timestamp of the time the latency
spike was measured, and the number of milliseconds the event took to
executed.

Latency spikes for the same event happening in the same second are
merged (by taking the maximum latency), so even if continuous latency
spikes are measured for a given event, for example because the user
set a very low threshold, at least 180 seconds of history are
available.

What is high latency for an use case, is not high latency for another.
There are applications where all the queries must be served in less than
1 millisecond and applications where from time to time a small
percentage of clients experiencing a 2 seconds latency is acceptable.

So the first step to enable the latency monitor is to set a latency
threshold in milliseconds. Only events that will take more than the
specified threshold will be logged as latency spikes. The user should
set the threshold according to its needs. For example if for the
requirements of the application based on Redis the maximum acceptable
latency is 100 milliseconds, the threshold should be set to such a value
in order to log all the events blocking the server for a time equal or
greater to 100 milliseconds.

The latency monitor can easily be enabled at runtime in a production
server with the following command:

CONFIG SET latency-monitor-threshold 100

By default monitoring is disabled (threshold set to 0), even if the
actual cost of latency monitoring is near zero. However while the memory
requirements of latency monitoring are very small, there is no good
reason to raise the baseline memory usage of a Redis instance that is
working well.

The user interface to the latency monitoring subsystem is the
LATENCY command. Like many other Redis commands, LATENCY accept
subcommands that modify the behavior of the command. The next sections
document each subcommand.

The LATENCYHISTORY command is useful in order to fetch raw data
from the event time series, as timestamp-latency pairs. The command will
return up to 160 elements for a given event. An application may want to
fetch raw data in order to perform monitoring, display graphs, and so
forth.

The vertical labels under each graph column represent the amount of
seconds, minutes, hours or days ago the event happened. For example
“15s” means that the first graphed event happened 15 seconds ago.

The graph is normalized in the min-max scale so that the zero (the
underscore in the lower row) is the minumum, and a # in the higher row
is the maximum.

The graph subcommand is useful in order to get a quick idea about the
trend of a given latency event without using additional tooling, and
without the need to interpret raw data as provided by
LATENCYHISTORY.

The LATENCYDOCTOR command is the most powerful analysis tool in the
latency monitoring, and is able to provide additional statistical data
like the average period between latency spikes, the median deviation,
and an human readable analysis of the event. For certain events, like
fork, additional informations are provided, like the rate at which
the system forks processes.

This is the output you should post in the Redis mailing list if you are
looking for help about Latency related issues.

Example output:

127.0.0.1:6379> latency doctor
Dave, I have observed latency spikes in this Redis instance.
You don't mind talking about it, do you Dave?
1. command: 5 latency spikes (average 300ms, mean deviation 120ms,
period 73.40 sec). Worst all time event 500ms.
I have a few advices for you:
- Your current Slow Log configuration only logs events that are
slower than your configured latency monitor threshold. Please
use 'CONFIG SET slowlog-log-slower-than 1000'.
- Check your Slow Log to understand what are the commands you are
running which are too slow to execute. Please check
http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy)
large objects is a blocking operation. If you have very large
objects that are often deleted, expired, or evicted, try to
fragment those objects into multiple smaller objects.

The doctor has erratic psychological behaviors, so we recommend
interacting with it carefully.