StaQWare: High Availability for Cobalt RaQ3i Servers - page 3

Extending the Cobalt RaQ3i Server Clusters

September 19, 2000

By
Lisa Phifer

Active and standby RaQ3i's try to maintain data synchronization all the
time, so that the standby can jump in immediately and replace the active
without loss of data.

Unfortunately, when a failed or disabled unit returns
to service, synchronization is lost. If just one block of data differs,
the standby must be completely resynchronized with the active.
We found 90 minutes typical when syncing 20.4 GB disks over a 100
Mbps cross-connect. When we replaced this cross-connect with a 10
Mbps hub, the period quadrupled.

The good news: the active RaQ3i continues to serve requests during resynchronization,
and we encountered no data corruption or loss when we interrupted this
process -- it simply restarted again from scratch. Web transactions and
mail transfers completed before failover were mirrored to the standby
and available after failover. The bad news: during resynchronization,
HA services are (naturally) unavailable, leaving the active vulnerable
to failure.

Failover -- When, Why, and How?

Failover from active to standby is initiated automatically when
StaQWare detects a problem (left). What kinds of problems
can StaQWare detect, and how quickly can service be restored?

If the standby cannot reach the active over Network 1, or failover
is initiated manually, the active is shutdown, the standby takes
over the active's address, and a mail notification is sent to the
administrator.

We found web, FTP, DNS and mail service restored within just 1-2 minutes,
using default tolerance settings. The active must be manually powered
off and on. This implies that HA services remain unavailable until someone
physically touches the box to restart it, and resynchronization completes.

If the active cannot reach the standby over Network 1, HA services are
unavailable but real-time synchronization continues over Network 2. When
Network 1 reachability is restored, HA services resume immediately, without
further resynchronization.

If the active and standby cannot reach each other over Network 2, the
active continues to service clients over Network 1 but drops into a non-HA
state. When Network 2 reachability is restored, a full resynchronization
is required before HA services are available.

If the active and standby cannot reach the default gateway, there is
no point in initiating failover, because both servers are configured with
the same gateway. Thus, StaQWare cannot be used to increase availability
through diverse routing. StaQWare indicates the problem, but does so in
a confusing manner. The standby LCD displays "RaQ is not in HA *check
cabling*" -- nearly always sound advice. But GUI status and mail notifications
indicate "Standby RaQ has failed or been disconnected". One cannot determine
the real problem unless actually looking at the standby's LCD panel.

We tested network failure by yanking Ethernet cables. We also tested
abrupt power failures. When the active loses power, the standby checks
its own file system, then takes over the active's address and restores
service (in our experience, within 6-7 minutes). When the standby loses
power, HA services become unavailable until the standby is restored and
data is completely resynchronized.

If both units lose power? The active and standby search for each other
for upon restart. After 5 minutes, if the standby cannot locate the active,
the standby takes over, restoring service within about 7 minutes. For
faster restoration, the standby's LCD panel can be used to boot in active
mode.