Code­Reaper

The dream of multi-threaded rsync

2014

As a former maintainer of a rather large datastore I know the problems of maintaining a backup of your datastore.

The common approach is to use rsync on a regular basis to keep your backup up to date. If you have a few millions files
to keep backups of, then each rsync session will take much longer than when you only had a few thousand files.

Keeping the backup window as small as possible is key when you want to limit the loss of data when disaster strikes.

If you are like me, you will have found through trial and error that multiple rsync sessions each taking a specific ranges
of files will complete much faster. Rsync is not multithreaded, but for the longest time I sure wished it was.

An idea was born

I was reading about some shell programming somewhere online and found the missing tool I needed to make rsync "threaded".
The missing tool was wait which waits for the current forked processes to complete.

The idea is to create a bunch of forked processes to act as threads for the backup process. There are a few prerequisites
for the way I have chosen to implement what I dubbed megasync. They are:

The primary server and backup server must be running a Linux or Unix system.

A few commands and paths might need to be changed if your primary server is not running FreeBSD.

The user running megasync is set up with passwordless ssh access to the backup server.

The files should be divided into many directories with a similiar amount of files in each.

The directories containing the files must have a single shared parent directory.

There should only be files in directories that are a certain deepth into the directory structure.

Putting theory into action

The lazy reader may skip the theory and explainations and go direct to the megasync.sh file.

To put the theory into actually code I will make a few assumtions for the purposes of explaining:

The data is located in a directory named /data/ on both the primary and backup server.

The data is divided using the following pattern /data/​<department>​/<client id>​/<short hash>​/<short hash>/ for example.

You feel that 6 thread is the right number for you.

Given the prerequisites and the assumtions the execution plan is as follows:

List every directory from /data/ with a deepth of 4.

Divide the list into 6 equal parts.

Fork and wait for 6 processes that creates all the output directories on the backup server.

Fork and wait for 6 processes that rsync each of the directories to the backup server.

Step 1 - listing

Use the find command with parameters defining a maxdepth of 4 and type of directories. This will give a list of directories,
but it will include paths to directories that are just one, two and three levels in. We can fix this by using a regex grep.
So the commands that will create the basis for megasync is:

This is why it is important that there should only be files in directories that are a certain deepth into the directory structure.

Step 2 - dividing

Now that we have a list of the directory that needs to be backed up, we need to divide it into six equal parts.
Naturally we create a convoluted while loop to make number named files with an extension of .dirlist.

Step 3 & 4 - forking

Before we can backup using rsync this way, we need to ensure the destination directories exists on the backup server.
Really it is simply a mkdir command for each directory, but let us do it threaded anyway. Forking a while loop making the
actual directories from inside a for loop and then using the wait command to make sure all the directories are created before continueing. The wait command is just awesome.