Search

File Synchronization with Unison

Keeping the files on multiple machines synchronized seems to be a recurring
problem for many computer users. Until I discovered Unison
(http://www.cis.upenn.edu/~bcpierce/unison/) I never really had a completely
satisfactory solution.

What we'd like to be able to do is efficiently keep two or more servers
completely synchronized with each other no matter what gets changed on any of
the servers. In the simplest case, we have a production server and a backup
server that we need to keep in sync. We might have a cluster of servers used
in a load balancing configuration. In the worst case, we might have a group
of computers where changes are occurring on any or all of the devices.
Consider the case where we have a computer at the office, a laptop, and a
work computer at home. We want to be able to work from any computer at any
time.

One solution is to simply use scp (http://www.openssh.com/) to copy the files
from one computer to the other or others. This solution requires that we
designate one computer to be the “master” and only changes that occur on the
master computer are propagated to the other, slave, computers. Besides a lack
of flexibility, this solution has one serious drawback; it copies every file
from the master to each slave computer, every time the synchronization
process is started. On a slow network link, or a large directory structure,
this often proves untenable.

A slightly better solution is to use rsync. (http://samba.anu.edu.au/rsync/)
The rsync program only transfers those files that are different. In fact,
rsync only transfers those parts of a given file that are different. This
mechanism is quite efficient, but still suffers from the master/slave
architecture that scp suffers from.

There are solutions that depend upon kernel services such as the FAM
(http://oss.sgi.com/projects/fam/faq.html) or clustered filesystems like
Coda. (http://coda.cs.cmu.edu/doc/html/index.html) These solutions, of
course, require a kernel recompilation, which seems like a lot of work to
simply keep a couple servers synchronized.

So far, unison is the simplest and most effective solution I've found. Unison
will correctly synchronize two servers even if changes occur on both servers.
If a change occurs in the same file on both servers, this causes a conflict,
and unison will display an error message. File content as well as permissions
and ownership can be synchronized. Unison even allows you to keep Linux
machines and Windows machines in sync. For those of you who have slow network
links, it's nice to know that unison works like rsync in that it only
transfers those parts of a file that have been changed, when possible.

Installing unison is trivial. The package management system in most Linux
distributions can automatically install unison for you. Otherwise, simply
download the source and compile it. You will need Ocaml installed, though.

Unison can be configured to use a native network protocol, or to use OpenSSH
in order to transfer files. The native protocol isn't authenticated, nor
encrypted, so it isn't nearly as secure as the ssh configuration. I recommend
using the ssh configuration and that's the configuration my example will use.
For automated synchronization, you will probably want to setup
certificate-based authentication for ssh. There are many easy-to-follow
instructions on the Internet that describe how to set this up, so I won't
cover that here.

Once you have unison installed, and ssh configured, it's time to start
synchronizing! But first, we should discuss, briefly, how unison works,
especially the first time it is run against a particular file repository. The
first time you use unison on a file repository, the program makes a note of
modification timestamp, permissions, ownership and i-node number for each
file in both repositories. Then, based on this information, it decides which
files need to be updated. The program stores all of this information in the
~/.unison directory. The next time unison is run on the file repository,
changes are trivial to detect. Intuitively, you might expect that unison is
examining the file's contents to see if the file has changed, but that isn't
what is happening. If a files modification timestamp and i-node number
change, the file needs to be updated. This is a very fast calculation and
scales well, even on very large files.

This should all be on one line. I do a lot of software development and in this
example, I'm using unison to synchronize the development directory from my
Internet accessible server to my workstation on my private network. Even
though this example is fairly intuitive, it doesn't get much more complicated
than this, so let's take a closer look.

The example synchronizes /home/mdiehl/Development on my server to the same
directory on my workstation who's IP address is 10.0.1.56. The ssh protocol
is used for the file comparison and transfer. Since this is a bi-directional
process, it doesn't matter where the script runs as long as the two machines
can reach each other over the network; it's just more convenient to run my
scripts on the server, but I could just as easily run this script from my
workstation if I change the IP address in the script.

The “-owner” and “-group” parameters tell unison to attempt to synchronize the
user and group ownership. You need to make sure that the owners and groups
exist on all of the machines you intend to synchronize. For example, if you
are syncing a directory owned by the user “bob,” who's uid is 500, you need
to be sure that “bob” exists on every server. Otherwise, you will find that
unison will create an entire directory structure owned by uid 500. This is
messy, but easily resolved.

Since I run this example command from cron, I use the “-batch” parameter,
which tells unison to not ask the user any questions, and simply do what it
can if there are any conflicts. Similarly, the “-terse” parameter keeps
unison from filling up my cron log with a bunch of unnecessary output.

When I run the example, above, I am presented with a list of updates that are
being made between the two computers. The final lines are the most important,
though:

As you can see, 8 files needed to be transferred in order to synchronize the
two servers. Fortunately, there were not problems, and all 8 files were
transferred, and my two machines are back in sync. If there were files with
conflicting changes, then we would see that in the “skipped” tally. If there
had been file permissions or network problems, those would have shown up as
failures. Either way, we'd want to go back through the log to find out what
happened.

In the several years that I've been using unison, I've only had a few problems
with it. As mentioned earlier, the most common problem stems from having
conflicting file changes. For example, if you make a change to a file on one
server and then change the corresponding file on the other server and the
files don't end up being identical, unison sees that as a conflicting change
and flags it. The way I usually resolve this problem is by deciding which
version I want to keep and using the “-prefer” option to tell unison which
version it should... prefer... when there is a conflict. In the example
above, if I wanted to have the local version overwrite the remote version, I
would add:

-prefer /home/mdiehl/Development

To the end of the command line.

The very first problem I had with unison was when I tried to synchronize two
directories that had several tens of thousands of files in them. Unison
simply ran out of memory. If I had one complaint about unison, it would be
that I have to break large file repositories into smaller pieces in order to
use unison to synchronize them. It doesn't seem to me that it should take
that much memory to do the book keeping, but I can't argue with the fact that
the tool works and I've never lost a file with it.

The unison website indicates that unison is no longer under active
development. This is unfortunate, but it shouldn't dissuade you from using
and trusting the program. I've found it to be quite mature and is still
actively being supported via the unison mailing list. I've had a few
occasions to ask for help on the mailing list and I've found the list be
extremely helpful.

Unison is a very effective means of synchronizing servers. It can be used in
a “star” topology to keep multiple servers in sync. I can also be used in
a “ring,” or any other topology you might need. The documentation is quite
extensive and well written. I hope you find it as effective and easy to use
as I have.

Mike Diehl is a Linux Administrator for Orion International at Sandia National
Laboratories in Albuquerque, New Mexico. Mike lives with his wife and two
small boys. Mike can be reached via email at: mdiehl@diehlnet.com