Basic DNS setup

The primary DNS records are managed at University of California, but they redirect to nameservers at Amazon. So most DNS changes must be managed through the Route 53 system.

Detecting Failure

Failover is based on tests made by Amazon Route 53. There is a health called "Failover -- Primary Test". This test combines the results of some other tests to determine whether the failover should occur. When this test is triggered, the Amazon DNS system will start delivering the IP address for the secondary server instead of the primary server.

Failover can be manually triggered (e.g. for system upgrades) using a process described on the WG:Emergency page.

What happens during a failure

During a failure of the primary server, all datadryad.org requests go to the secondary server.

Apache is configured on the secondary server to disallow logins or submission of data.

The Dryad pages have login/submission features replaced with messages saying the feature is currently disabled.

If users try to access a login/submission URL directly (e.g., using a link from an integrated journal), a static HTML page is displayed. This page explains that submissions are currently disabled.

Secondary server

Apache mod_rewrite and mod_substitute disable logins to this instance of Dryad

Files replicate by rsync.

Keeping secondary server in sync

Main rsync of data files

There is a cron job on the secondary server (root account) that performs the rsync every minute: /root/scripts/rsync-from-ncsu.sh

Configuration

Other notes

We need to have solr run in a separate instance of tomcat or jetty or upgrade to the latest solr before this will work. Jetty would probably use less memory.

If we don't want to depend on a third party like MCNC or want more extensive "health" checks, we could set up a virtual machine (or two) at a cloud host such as EC2 and use it for failover. This would allow for more extensive testing of the primary site in order to trigger a failover. I have used this in the past (http://cbonte.github.com/haproxy-dconv/configuration-1.4.html#4-http-check expect) and it can trigger failover based on a string in the HTTP response similar to our current Nagios heath checks. This would also be inexpensive ($50-$100/month) as the virtual machines could be very small such as EC2 micro instances. Large data transfers could go directly to the primary server rather than through the load balancer and thus would not count against any bandwidth quotas.

If we want to stick with MCNC or another failover service using HTTP status for heath checks, we could set up Nagios health checks of the production site that would shut down Apache and trigger a failover if a certain string is not on the website.

Use two way database replication. Bucardo supports this and the basics could be set up fairly easily, but would require much testing.

Make the failover site read/write. If we control the failover process, we could make the secondary server read/write. Before we would switch back to the primary, we could sync files and the database back from the secondary to the primary. This would involve some down time and more complication, but it doable.

SOLR 4 (in alpha as of 8/2012) should handle master-master replication. Currently SOLR can only perform master-slave replication.