Cassandra Backup and Restore

Cassandra is a powerful NoSQL database that can be easily scaled. This makes a large distributed Cassandra cluster highly fault-tolerant. Depending on the size of the cluster, Cassandra can survive the failure of one or more nodes without any interruption in service. It then may not be obvious why backups are even needed. There is, of course, the very unlikely catastrophic failure that will require you to rebuild your entire cluster. More likely though, data can become corrupt. In either case it would be useful to roll back the cluster to a known good state.

Cassandra provides a useful command line tool for creating snapshots of the data called nodetool. Nodetool has many other uses, but for this post we'll look specifically at the snapshot command. The documentation for snapshot can be found on the Datastax Website. As a quick overview, nodetool snapshot flushes all data in memory to the disk. The data is then stored in a snapshot directory alongside the existing data files. You can provide a tag for the snapshot using the -t flag or the snapshot tool will tag it with a timestamp. The process for restoring a node is a mostly manual procedure that requires you to delete all commit logs and data. You must then copy all of the data from a snapshot into the data directories. More detailed instructions can be found here.

Cassandra can be deployed in many ways. These deployments can either be bundled with the solution or not. We explore a way in which Cassandra is part of the solution. In this instance we provide Cassandra via a docker container to make it easy to deploy and manage. This post explains a simple backup and restore solution so that end-users can feel confident that their data is safe and in their control.

In a previous blog post it was mentioned that we have created a Python command line tool called node-admin. Node-admin is our orchestration tool, which paired with Ansible playbooks allows us to deploy a group of Docker containers on multiple nodes. One of the many containers it orchestrates is Cassandra. Node-admin, controlled by Ansible, is responsible for things such as clustering, backing up, and restoring Cassandra.

Backup

First, let's talk about backing up Cassandra with node-admin. Since Cassandra lives in a Docker container, we create a volume that maps Cassandra's data directory to a directory on the filesystem. Not only is this better for performance, it also makes backup and restore easier. The basic process is as follows:

1. Create a snapshot using Cassandra's nodetool.

Node-admin uses docker-py to interface with Docker containers, however, this step would translate to a docker exec that would look like the following:

This is a simplified view of what a snapshot looks like, but it shows the basic directory structure.

2. Traverse the data directory looking for all the snapshot directories of that tag name.

The next thing node-admin does is fully traverse all of the tables within those keyspaces to collect all of these snapshots. It moves all of these files off into a temporary directory. In the process of doing this it removes the snapshots/some-tag directories to make restoring the data easier.

3. Create a tgz containing the data as it would appear in Cassandra.

The data that was stored in the temporary directory will look like the data directory of a Cassandra node, so at this point we create a tgz that is stored in a location supplied by the user. This data can then be copied off of the node using Ansible so that the backup data is not stored in the same location as Cassandra. This step would be to mitigate the damage of a catastrophic failure scenario.

These steps are fairly easy to do with Python when the Cassandra directory is mapped to the host. The finished product is a single node-admin command.

The user must provide a tag for the backup. If no backup_dir is specified the tgz will be stored in a default location. We would like to also have the --tag flag optional. In this case the tag would most likely be a timestamp.

Restore

The process of restoration is a little bit more involved but still fairly simple. The process is as follows:

1. Stop the Cassandra Docker container.

We use the "Node restart method" outlined in the Datastax documentation provided at the beginning of this post. The "Node restart method" says to shut down the node. Since we're using Docker containers the easiest way to do this is just stop the container. The data directory will be preserved on the host file system since we used Docker volumes.

2. Clear commit log and data directories.

The next step is to clear out the commit log directory. As stated in the Datastax documentation this prevents Cassandra from overwriting the data we are about to restore. We also go ahead and delete the contents of the data directory. Since we removed all of the snapshots during the backup process we don't need to worry about accidentally deleting snapshot directories here.

3. Unzip the tarball into the Cassandra data directory.

To actually restore the old db files we just unzip our tgz into the data directory. The directories and db files will be preserved in the tgz so that they will end up in the proper place on the filesystem.

4. Start Cassandra container and repair.

Finally, we start a new Cassandra container mapping the data directory into the proper place. We then exec into the container and run nodetool repair.

docker exec cassandra /opt/apache-cassandra-2.1.2/bin/nodetool repair

The node-admin command is very similar to the backup command.

node-admin restore --backup_dir /opt/cassandra/backup/test-backup.tgz

The only thing it requires is the path to the tgz you need to restore.

Other Considerations

As you can probably tell, the state of the Cassandra container is important during that backup and restore of data. To solve this problem and make it as easy as possible for the user we use Python context managers. If the user tries to backup while a Cassandra node is not running, we will start the node, backup, and shut the node back down. If the user tries to restore Cassandra while Cassandra is already running, we stop the node, perform the restore, and start the node again.

In conclusion, the goal of our backup and restore utilities is to make them as seamless and easy for the customer to use as possible. Backup and restore is just one consideration when shipping Cassandra to a customer, but these are the things that we have to think about when developing our platform. Expect more blog posts about Cassandra orchestration and deployment in the future.

About ADTRAN

ADTRAN, Inc. is a leading global provider of networking and communications equipment. ADTRAN’s products enable voice, data, video and Internet communications across a variety of network infrastructures. ADTRAN solutions are currently in use by service providers, private enterprises, government organizations, and millions of individual users worldwide. For more information, please visit www.adtran.com.