My personal notes about using and abusing Linux for business and pleasure. I run Linux on all my computers and look after countless other Linux servers and workstations, both as a hobby and professionally.

That's Linux. The trials and tribulations of using Linux to get stuff done.

I'm doing some Linux High-Availability fail-over testing on a heartbeat+drbd cluster, and need to prove that in the event of the primary node dying, the secondary will take over. I've done the usual orderly fail-over tests with:

hb_takeover all

We have remote KVM, so I can login to the console remotely; although these servers are in a data-centre hundreds of miles away in London Docklands, I can safely shut down the network and still work on the machines with:

service network stop

Never do that on a server that you only have an ssh connection to! It certainly causes the other node to take over, but it's also not a very good test because you end up with both peers thinking they're the primary (although one is blind to the world)- the dreaded "split brain". That takes a bit of sorting out - and about 20 minutes to resync the data store.

I really needed to do a better test it than this. Googling round, I found several suggestions, such as "kill heartbeat" or "shutdown the primary", but these just cause an orderly take-over because even when you kill it, heartbeat, with its dying breath, tells the other node that it is croaking. I needed something more brutal.

But, because they are so far away, I can't just "unplug the power" to simulate a sudden failure. More googling revaled the answer - the kernel /proc filesystem sysrq interface. What you do, is load the sysrq gun, and pull the trigger. BAM! the machine restarts with no further warning. No shutdown scripts are run, no messages are sent from the dying heartbeat daemon.

There's also no "are you sure?" message, so only do this on a machine you are sure can stand having the power yanked out - and don't hold me responsible for corrupting your hard disk.

<UPDATE>
I just tried this on a debian Lenny box (Kernel 2.6.26-2-amd64) and it didn't work, unless I made the second command:

echo b > /proc/sysrq-trigger
</UPDATE>

This will cause the kernel to issue a system-reset. The screen will go blank and then eventually you will see the BIOS boot...

Meanwhile, if you've got everything set up right, on the other server, heartbeat will notice that the primary is dead and begin automatic takeover.

When the other machine has finished booting Linux, it will re-appear on the network and the two machines will re-form the cluster. Heartbeat will start up on the freshly booted former primary, and connect to the new primary. Then so will drbd - which should promptly begin synchronising the data sets. Once this is complete, you can manually fail back to the original primary again, using:

hb_takeover all

The test I did this morning did the whole takeover process in 22 seconds, which includes about 15 seconds for heartbeat to decide that the primary has died. Sounds pretty good to me.