Fleet cluster recovery

I've played around quite a while with Etcd now and it turned out to be essential to backup your data frequently. This guide describes how I usually recover crashed Fleet cluster without following the lately documented way of Etcd backup. I recommend to follow the official approach using etcdctl however the recovery procedure is complex and operating on a JSON dump can give you more flexibility. Disclaimer: use this unofficial guide below on your own risk.

Assumption

There is (or was) an Etcd cluster running on more than one node which for some reason is not operating as expected anymore. To bring the cluster back to life one needs at least one running node with valid data (to be recovered) or a valid Etcd backup. Furthermore the procedure only makes sense if the docker container Fleet has been managing before are still (part wise) running.

##Creating a Etcd backup

Fleet is storing all necessary data in the hidden directory /\_coreos.com. In order to to list the data directory with etcdctl you need to fire the following command:

$ etcdctl ls /\_coreos.com --recursive

To make a dump of the data stored on the Etcd node use the tool etc-backup. The tool needs two configuration files (etcd-configuration.json, backup-configuration.json). Here is an example for the production environment:

Make sure you have the two config files (see above) in the same directory. The dump will be created as a file dump.json. Store it at a safe place.

Recreate the Etcd cluster

Once the configuration of Fleet has been saved as a backup the existing Etcd cluster can be setup from scratch. It is also possible to repair the cluster by hand adding and/or removing nodes, however the current state of the implementation (0.5.0-alpha4) is not very stable.

To operate on different VMs at the same time use csshx on OSX (can be installed with brew). Connect to the hosts the Etcd cluster is running and do the following:

$ systemctl stop fleet
$ systemctl stop etcd
$ rm -rf /data/etcd

Now fix the configuration on all nodes that are intended to join the party. Make sure the ETCD\_INITIAL\_CLUSTER\_STATE=new is set in /etc/systemd/system/etcd.service. Restart the Etcd server on all nodes and check that it is operating as expected (use systemctl start etcd and etcdctl member list). Don't start Fleet for now!
Now restore the backup to the cluster using the dump file from the backup procedure:

How to improve

Establish automated backup of the etcd cluster content. I started this by implementing a cron-like container which is dumping Etcd on a regular base and uploading it to AWS S3;

Move Etcd cluster away form the docker hosts. Set up a small dedicated pool of 3 machines that run the Etcd cluster and reconfigure Fleet to use this instead of having it on the same machines as Fleet. Preferably don't touch it ;).