Timeouts are a common design choice or implementation
detail in any
computer system, but are in particular popular in
High-Availability
clusters
(such as those build with the SUSE
Linux
High-Availability Extension and other stacks that are
similarly
based on corosync and pacemaker).

They are seemingly straightforward to detect faults: if
the task
doesn't complete within N seconds, it is considered failed,
and recovery
attempted. (The task could be anything from a network messaging
protocol, a database starting under the cluster's control,
any IO, and a
number of other cases.)

However, selecting a good value for the timeout is less
straightforward than it may seem; more often than not, they
are much too
short. This seems to stem from the belief that a fast
response to
failures is unconditionally a good thing: the system
will perform
better if timeouts are shorter. This is not quite true,
though.

To illustrate, assume two scenarios:

First, that the system has failed in such a way that it
will not respond with a failed response to a monitor task
immediately,
but instead runs indefinitely unless aborted by the
timeout.

Second, that the system is operating fine, but
experiencing a brief
period of stress, where responses are delayed, just to the
edge of the
timeout value.

Now, let us explore the impact of a timeout that is one
second
"too long"; and then, one that is one second "too short".

For a too long timeout, the failure in the first scenario
is detected
one second later, adding one second to the recovery time. In
the second
scenario, no timeout occurs, and the system continues as
normal.

For the too short timeout, the first scenario is
recovered one second
faster; the second scenario causes an unnecessary
recovery,
probably incurring a real service outage in the
attempt to restart the
application, or at least a brief period without service!

Another problem arises from how timeouts are often
chosen; of course,
if they were obviously too short, administrators would
immediately notice, since their system would never get off
the ground at
all, but immediately start spewing errors. Instead, the
timeouts are
usually adequate for the tested scenario (note that
you can use
the pacemaker monitoring tools to look at the actual runtime of
operations); if your test load exceeds the load of your live
system,
raise your hand - more often than not, it does not.

Under a stress/peak load, the system response tends to
degenerate
exponentially; it will not just slow down by ten percent,
but by thirty.
If this scenario gets treated as a failure, the likelihood
that the
fail-over system will experience the same level of stress is
high;
worse, requests may have queued up, and if - due to the
stress, remember
- the system did not shutdown cleanly, an
application-internal recovery
phase will compound the effect.

Monitoring application performance for load-distribution
is quite a
different task from monitoring application correctness. The
former is
important, and a performance degradation may also imply
violation of
service level agreements; however, initiating recovery
through restart
is unlikely to alleviate the problem. (In a pacemaker
cluster, this
would best be monitored externally and fed into the utilization
constraints of the resources and nodes.)

In summary, a too short timeout is the worse choice;
rather, it is
safer to make hard timeouts large enough beyond reasonable
doubt. Yes,
it will slow down the fail-over and recovery slightly, but
at least not
cause them by mistake.

It has been a while since I took the chance to blog here;
the time has been pretty packed with shipping SUSE Linux
Enterprise 11 Service-Pack 1's High-Availability Extension
(or SLE HA 11 SP1 for short ;-), and supporting the first
deployments.

It is a good time to look back and review the very
awesome new features that the community developed along with
us, and that we are shipping as Enterprise-ready now.

A feature that I am personally very impressed by is the
OCFS2 reflink feature; basically, OCFS2 cracked the hard nut
of cluster-wide copy-on-write snapshots, which LVM2 has been
trying to for years. This allows space-efficient and very
fast provisioning of new VMs, snapshots for backup, cloning
from templates, cloning from clones, etcetera; it really is
amazing.

For those of you who prefer a visual, the team from NGN
taped a
video with me being interviewed by Sander at Novell's
BrainShare in Amsterdam; this
is my first video interview ever!

A very common fallacy when setting up
High-Availability
clusters - be it on Pacemaker + corosync, Linux-HA, RedHat
Cluster Suite, or else - is thinking that your setup,
despite all the warnings in the documentation or in the
logfiles, does not require node fencing.

What is node fencing?

Fencing is a mechanism by which the
"surviving" nodes in the cluster make sure that the node(s)
that have been evicted from the cluster are truly gone. This
is also referred to as node isolation, or, in a very
descriptive metaphor, STONITH ("Shoot the other node in the
head"). This mechanism is not just "fire and forget", but
the cluster software will wait for a positive confirmation
from it before proceeding with resource recovery.

But it has already failed, otherwise it would not
have been evicted, so why would this be necessary, you ask?

The key here is the distinction between
appearances and reality: a complete loss of
communication with a node looks to all other nodes as if the
node has disappeared. Since you, like the obedient
administrator that you are, have configured redundant
network links, the chance for this to happen is really slim,
right? But that is not the only possible cause. In fact, it
might still be around, just waiting to come out of a kernel
hang, or hiding behind firewall rules, to spew a bunch of
corrupted data to your shared state.

In short, node fencing/isolation/STONITH ensures the
integrity of your shared state by turning a mere, if
justified, suspicion into
confirmed reality.

(Pacemaker clusters also use this mechanism for escalated
error recovery; if Pacemaker has instructed a node to
release a service (by stopping it), but that operation
fails, the service is essentially "stuck" on that node. The
semantics of the "stop" operation mandate that it must not
fail, so this indicates a more fundamental problem on that
node. Hence, the default process then would be to stop all
other resources on that node, move them elsewhere, and fence
the node - rebooting it tends to be rather effective at
stopping anything that might have been stuck. This can be
disabled per-resource if you don't want some low-priority
failure to shift high-priority resources around, though.)

This is all very technical. So let me tell you a story
with several possible endings to illustrate.

Story time!

Once upon a time, three friends were sitting
huddled around a fire, peacefully eating their cookies. It
was a tough time: the world was out to get them, a zombie
infection was spreading, they couldn't trust anyone outside
their trusted cluster of friends. They were always watchful
and paid attention to each other.

Suddenly, one of the three stops responding to the
conversation they were having. How do you proceed?

My cluster of friends does not require such a crude
mechanism! He'll be careful not to have been infected! If he
stops responding, he will simply be dead! You ignore the
problem, but then your former friend revives, spreads his
infection to your cookie stack, starts clobbering you with a
club to eat your brains, and his howl gives away your
location to all his new friends, who come down on you with
the intent of eating your brains.

You use an unloaded gun to shoot your friend - the
trigger responds reassuringly. Your former friends
revives, and it is all about eating your brains
again.

You kindly tap your friend on the shoulder, and
suggest that he please commit suicide. Your former
friend revives, snaps at your tapping hand, and starts
eating your brains.

You speak a pre-agreed upon code word, a tiny
bomb
goes off in the head of your friend, blows his brains
out, and he drops on the spot. The grue does not eat
you. (In fact, the mechanism monitoring his brain probably
has already blown him up, but you speak the code word anyway
to make sure.)

You take that crude, trusty shotgun and blow
his brains out, aiming away from the stack of
cookies. The grue does not eat you.

So what?

In order, we have gone through the "I do not need
STONITH or have disabled it", "I used the null
mechanism intended only for testing", "I used an
ssh-based mechanism", or the recommended "a
poison-pill mechanism with hardware watchdog support" (such
as external/sbd in Pacemaker environments) and the
time-tested "talk to a network power switch, management
board etc to cut the power" methods.

Pacemaker's escalated error recovery could be likened to
your friend telling you that despite his best attempts,
his wound has become infected (and he can't bring himself to
cut off his hand); he bravely gives away his
equipment to you, kneels down, says goodbye, and you blow
his brains out.

Does that drive the point home? How would you like to
survive armageddon? Of course, it is always possible that
you have a secret liking for becoming a zombie, and
crumbling (instead of eating) all your cookies.

For the full cluster functionality with
OpenAIS/OCFS2/cLVM2 and an OCFS2 mount on top, you need to
configure DLM, O2CB, cLVM2 clones, one to start the LVM2
volume group, and Filesystem resources to mount the file
system. Add in all the dependencies needed, and you end up
with a configuration pretty much like this (shown in CRM
shell syntax, which is already much more concise than the
raw XML):

Today I'd like to briefly introduce a new safety feature
in Pacemaker.

Many times, we have seen customers and users complain
that they thought they had correctly setup their cluster,
but then resources were not started elsewhere when they
killed one of the nodes. With OCFS2 or clvmd, they would
even see access to the filesystem on the surviving nodes
blocking and processes, including kernel threads, end up in
the dreaded "D" state! Surely this must be a bug in the
cluster software.

Usually, it turns out that these scenarios escalated
fairly quickly, because usually customers test recovery
scenarios only fairly closely to before they want to deploy,
or find out after they have deployed to production already.
Not a good time for clear thinking.

However, most of these scenarios have a common
misconfiguration: no fencing defined. Now, fencing is
essential to data integrity, in particular with OCFS2, so
the cluster refuses to proceed until fencing has completed;
the blocking behaviour is actually correct. The system would
warn about this at "ERROR" priority in several places.

Yet it became clear that something needed to be done;
people do not like to read their logfiles, it seems.
Inspired by a report by Jo de Baer, I thought it would be
more convenient if the resources did not even start in the
first place if such a gross misconfiguration was detected,
and Andrew
agreed.

The resulting patch
is very short, but effective. Such misconfigurations now
fail early, without causing the impression that the cluster
might actually be working.

This does certainly not prevent all errors; it can't
directly detect whether fencing is configured properly and
actually works, which is too much for a poor policy engine
to decide. But we can try to protect some administrators
from themselves.

(As time progresses, we will perhaps add more such low
hanging fruits to make the cluster "more obvious" to
configure. But still, I would hope that going forward, more
administrators would at least try to read and understand the
logs - as you can see from the patch, the message was
already very clear before, and "ERROR:" messages definitely
should
catch any administrators attention.)

We understand it is a work in progress, and the uptodate
docbook sources will be made available under the LGPL too in
the very near future in a mercurial repositoy, and we hope
to turn this into a community project as well, providing the
most complete documentation coverage for clustering on Linux
one day!

Okay, that can happen. Sometimes driver writes have to
make guesses when the vendor is not cooperative or
unavailable. So who wrote the driver?
* (c) Copyright 2007 Hewlett-Packard Development
Company, L.P.