clustered cron jobs

We’ve been recently running rest APIs on active-active server pairs (docker containers running on pairs of VMs on separate hosts) with postgres-BDR (multi-master bidirectional replication) for fault-tolerant storage and a pair of fault-tolerant HAProxy instances for incoming request routing. This is a robust setup which provides zero downtime during rolling updates or hardware maintenance or failure.

However, clustering scheduled jobs (i.e. ensuring that scheduled jobs execute exactly once) becomes a problem in this configuration. Multi-master replicated databases avoid a single point of failure but they are not suitable for use with database-based clustered schedulers like Quartz, so we needed to consider other options. There are complex clustered job schedulers, but we wanted to keep it simple and use Linux crond for scheduling. We finally settled on using keepalived to maintain a single master across the cluster and schedule the cron jobs identically on all servers, using a small if_master.sh script to ensure that a crontab entry only runs if the server is currently the keepalived master. The crontab entries then look like:

0 * * * * ./if_master.sh "./command.sh"

if_master.sh checks if the server is currently keepalived master (by looking for the floating ip address)