Recovering A Sentinel Configuration

May 23, 2015

Sentinel configuration and configuration management systems don’t play well
together, and neither do package management systems and the config file. As a
result it is possible to have your sentinel configuration file wiped clean
under a running sentinel. Here are some ways you might be able to recover your
running configuration.

Update: Article updated to reflect the release of a version supporting the
flushconfig command.

Requirements

First, let us consider the non-recoverable scenario. If every sentinel in your
constellation has had the file cleaned and has been restarted, and you are
not running RedSkull, there is no recovering it without rebuilding it the way
you built it in the first place. However, if even one sentinel server is
still running with the configuration in memory you can recover it. If you are
running RedSkull, you can recover even if all the sentinels have gone down -
provided at least one RedSkull server as not been restarted.

How do you know the configuration is still in memory? Connect to sentinel using
redis-cli -p 26379 -h <host> and issue an INFO command. If you see your
pods listed, congratulations - you can recover.

Scenario 0: Redis version >= 2.8.21 or 3.0.2

If you are running at least 2.8.21 or 3.0.2 you will have the easiest time. For
this you connect to your sentinels and execute “SENTINEL FLUSHCONFIG”. Done.
Now wasn’t that easy?

Scenario 1: No Red Skull, sentinel process still running

The most simple option is to pick one of the pods in Sentinel, and
“change” it’s configuration. For example, you can do a SENTINEL SET
<podname> parallel-syncs 1. Ideally you’d use the same setting it is
now. If you’ve not modified the settings the setting parallel-syncs 1
doesn’t really change the config, but it makes Sentinel think it
did.

Once sentinel thinks it has changed a setting it will trigger a full
write of the config to disk. This rebuilds your file. Do this on each
sentinel in your constellation and you’ve recovered.

Scenario 2: Red Skull running, Sentinels Restarted

For this one you’ll need to pull the constellation configuration from
Red Skull as your sentinel daemons have a clean slate. For this
scenario you will rely on the fact that Red Skull stores the
configuration of every single pod it knows about. There are two ways:
simple and complex.

Simple Recovery

With this option all of your sentinels will look alike. If you’re
running Red Skull to distribute the sentinel job across a bank of them,
this may not work cleanly for you - but it will get you to a state you
can recover from manually by rebalancing each pod.

For the simple route you pull the JSON data via the Red Skull API via:
http://red.skull.host:8000/api/knownpods, then iterate over them adding each
one back into Sentinel via the Sentinel API
or by writing a new file and starting sentinel back up. Here is a short Python
script to do the latter for you:
Redskull-To-Sentinel-Config-File

However, if you are leveraging Red Skull’s ability to manage a cluster of
Sentinels for you, you’ll probably prefer option two: complex recovery.

Complex Recovery

For this option we are going to do a more in-depth data dump from Red Skull,
and it is not yet guaranteed to work as it uses Red Skull pathways which pull
data from Sentinel.

For this option you will iterate over every known pod and write (append) to a
file for each known-sentinel. Depending on the elapsed time from the event to
when you do this there may be no other known-sentinels. However, it is likely
the data is still there, so it should in general work.

To do this you will need to generate and store a mapping of sentinel ->
managed-pod. For example, in a Python script you might have a dictionary for
each sentinel where the name of the sentinel might be ip_port and the
dictionary in that variable contain the pod name, current master, and any
settings for it such as the auth-pass setting.

You would iterate over the /api/knownpods pod listing building these
dictionaries up by talking to every Red Skull server in the constellation. Once
you have the set of dictionaries you can then loop over each one and follow the
basic procedure outlined under the Simple Recovery section.

Caveats

There is a big caveat to recovery via RedSkull. If a current master goes down
before you can recover that pod will need to be manually removed and added as
Red Skull won’t know about it because Sentinel didn’t. You could possibly
recover by looking at the “old” master/slave host data and manually checking
and updating it.

As far as adding via the Sentinel API or by writing a config file, I prefer
using the API. Additions are more immediate and you don’t need to bother with
stopping sentinel, writing files, and restarting. Indeed you would not even
need to do these operations on the Sentinel nodes directly but, if you have
connectivity, you can do it from a bastion or your laptop/desktop.

Future Options

Despite our best efforts these scenarios can still plague us - whether it be
from automated code or from an “accidentally” by a human. Ideally Sentinel
would not store constellation state in it’s config, and there is work in
progress to do just that. However, it could still happen even with a state file
instead of the config file.

Because it could still happen we now have the ability to tell Sentinel to flush
it’s config to disk, provided you’re running an up-to-date version. If not,
then at least now you have some other options - especially if this happens
while trying to update to the newer versions.

Hopefully should you ever find yourself in this unfortunate situation these
methods will provide you some means of retrieving at least some of your sanity.
And, of course, your configuration.