December 9, 2008

The script started with a simple, small idea. Some simple task like backing up
a database or running rsync. You produce the script matching your requirements
and throw it up in cron on some reasonable schedule.

Time passes, growth happens, and suddenly your server is croaking because 10
simultaneous rsyncs are happening. The script runtime is now longer than your
interval. Being the smart person you are, you add some kind of synchronization
to prevent multiple instances from running at once, and it might look like
this:

You have your cron job put the output of this script into a logfile so cron
doesn't email you when the lockfile's stuck.

Looks good for now. A while later, you log in and need to do work that requires
this script temporarily not run, so you disable the cron job and kill the
running script. After you finish you work, you enable the cron job again.

Due to your luck, you killed the script while it was in the rsync process,
which meant the 'rm $lock' never ran, which means your cron job isn't running
now and is periodically updating your logfile with "Lockfile exists, aborting."
It's easy to not watch logfiles, so you only notice this when something breaks
that depends on your script. Realizing the edge case you forgot, you add
handling for signals, just above your 'touch' statement:

trap "rm -f $lock; exit" INT TERM EXIT

Now normal termination and signal (safely rebooting, for example) will remove your
lockfile. And there was once again peace among the land ...

... until a power outage causes your server to reboot, interrupting the rsync
and leaving your lockfile around. If you're lucky, your lockfile is in /tmp and
your platform happens to wipe /tmp on boot, clearing your lockfile. If you
aren't lucky, you'll need to fix the bug (you should fix the bug anyway), but how?

The real fix means we'll have to reliably know whether or not a process is
running. Recording the pid isn't totally reliable unless you check the pid's
command arguments, and it doesn't survive some kinds of updates (name change,
etc). A reliable way to do it with the least amount of change is to use
flock(1) for lockfile tracking. The flock(1) tool uses the flock(2) interface
to lock your file. Locks are released when the program holding the lock dies
or unlocks it. A small update to our script will let us use flock instead:

This change allows us to keep all of the locking logic in one small part of the
script, which is a benefit alone. The trick here is that if '$flock' is not
set, we will exec flock with this script and its arguments. The '-w 0' argument
to flock tells it to exit immediately if the lock is already held. This
solution provide locking that expires when the shell script exits under any
conditions (normal, signal, sigkill, power outage).

You could also use something like daemontools for this. If you use daemontools, you'd be better off making a service specific to this script. To have cron start your process only once and let it die, you can use 'svc -o /service/yourservice'

Whatever solution you decide, it's important that all of your periodic scripts will continue running normally if they are interrupted.

Further reading:

flock(2) syscall is available on solaris, freebsd, linux, and probably other platforms

14 comments
:

lockfile-progs requires that a lockfile be touched at least once every 5 minutes, which puts you in a horrible situation if a command (like rsync, here) runs longer than that. You're almost guaranteed to have a situation where lockfile-progs considers your lockfile to be stale when the process is still running.

You're right Paul, but that's exactly the point Jordan was making, that the more commonly used way of locking has logic holes. Your rewritten version still fails in the 'power outage' scenario that the flock version solves.

Thanks Jordan, this was a great tip for an old sysadmin to learn, it's a beautiful technique.

I don't know what package (In debian/ubuntu) contains lockf, I tried searching, but flock(1) comes with util-linux-ng (util-linux on ubuntu) and is generally always installed since that package also provides other basic system tools like fdisk.

Also, freebsd has lockf(1) by default, not flock(1). FreeBSD lockf and util-linux-ng flock(1) are similar tools.

Agreed about lockfile-progs, that tool is a very bad misimplementation of file locks.