Boris Wedl wrote:
> Alan,
>> Great work so far! This is maybe a very stupid question
> but what prevents a machine from thinking it is the last one
> working in the cluster if its own heartbeat subsys goes down?
>> Greetings and thanks in advance
>> Boris! - boris.wedl at kfunigraz.ac.at> Thats great it starts with an earthquake...
This isn't a stupid question at all... Since it seems a very good question,
I've taken the liberty of copying the reply to the list - because I may get the
answer wrong :-)
First off - if a machine's heartbeat communication with it's peers goes down,
it can VERY EASILY think it's the only computer in town still working. It
behooves any IP takeover (or other recovery) system to not completely believe
the result of a heartbeat failure. Performing a ping or other check on it's
peers is well advised before taking over the work of a dozen other computers
:-)
Second off, you do well if you choose the most reliable communication system
you can get for heartbeat and then think very carefully about its physical
security and topology. That was one of my concerns about the modem
communication arrangement mentioned by someone earlier. Even so, it makes
sense for some kind of IP version of the heartbeat exist, if only as a backup
and verification.
This is an important topic for the following reason:
What if your network fragments into two working subnets that can't communicate
with each other, and that EACH KEEPS WORKING independently. Maybe they've done
database updates on their own half of a set of data, and now one day the link
between the two comes back up -- you're now faced with the not-at-all automatic
merging of the application data back into one piece.
Putting the cluster back together is easy, but putting the application data
back together *can* be a disaster. This highlights the importance of an
application tie-in to the HA subsystem, and how the nature of the application
and how it works affects the requirements on the HA subsystem. The HA
subsystem can't make a badly architected cluster and communications system
reliable.
If a node sees a rapid and radical change in cluster topology, it might be the
smartest thing going for the applications to decide to sit down and do nothing
until they hear from a human being. This can ultimately improve reliability
and availability (from an application perspective) -- strange as it seems.
What I've designed (and implemented) is only the beginning -- the rudiments, of
the core part of an HA subsystem. It gives us a framework for asking and
answering (and pulling our hair out over) the interesting questions (like the
one you asked).
-- Alan Robertson
alanr at bell-labs.com