This web page is no longer maintained. Information presented here exists only to avoid breaking historical links.The Project stays maintained, and lives on: see the Linux-HA Reference Documentation.To get rid of this notice, you may want to browse the old wiki instead.

Concepts

To reliably verify expected behaviour of our system during various events, we must be able to simulate all the various events; and we must be able to do so in a reproducible fashion. This implies that events either need to be

generated randomly (starting from a given random seed),

manually described sequences (to reproduce very specific scenarios).

Ideally the test system is sufficiently generalized to be applicable to an other cluster management system or even non-managed system without too much modification. This is achieved by some abstraction of the involved components, and then controlling the underlying actual objects by means of hardware (relays, power switches, whatever; only useful if you really have the resources in some large scale testing lab) or software (preferable, if at all possible, since then the tests can be run virtually anywhere).

I apologize that the presented hardware abstractions are linux-centric. This is just because I know it best. And guess it is the OS of choice for most deployments that use this project, anyways.

We expect to have a TestingEnvironment, where one non-cluster machine runs the test, and is in full and exclusive control over the to-be-tested set of cluster nodes, and possibly some other machines as well (which could represent ping nodes or clients, or whatever we may come up with later). Lets call this controlling and monitoring box (or equivalently the

controlling software) the Exerciser. Obviously "full and exclusive control" implies that the Exerciser can issue arbitrary commands as root on all nodes. === Mini Howtos ===

Audits and Integrity Verification

Does the state of the real world match the state expected by our testing system?

Are all managed resources running?

... on the expected nodes?

... with the expected instance count?

... with all inter-resource dependencies correctly resolved?

In a degraded cluster, are all the higher priority resources running?

Do all managed services properly answer client requests?

Is "replicated shared storage" identical, regardless of access path?

There are at least two basic strategies for audits:

Internal or white-box service/resource audit can be done by just calling the <RA> status operation (as appropriate for the resource agent class), which basically is: execute the resource agent scripts on every node with the status query. With the CRM running, we should query the CRM on each node for the CIB, and verify that it is all the same as on the master CRM (DC), and matches our own findings from the direct status operations; and our overall expectations, of course.

External black box service verification. From the test master (or from a farm of clients controlled by it) issue regular client requests to the cluster services. Trigger faults (as above) and verify that the cluster recovers the service, and measure the time and/or percentage of failed requests. LarsMarowskyBree would really love this feature, it's the most important testing capability still missing in his opinion... These application level audits could be done by scripts in some app-level-audit and/or simul-client-requests directory, which should be named after the respective resource agent scripts.

Hardware Failure

Every abstracted hardware object can take two events:

fail and heal. Since hardware failures are "force-majeur", this fail can happen anytime. Some hardware failures may be "self-healing" under certain circumstances (e.g. temporary network failure). Most hardware failure in the real world involve operator intervention to be solved. So the Exerciser is free to "decide" after a time, that a certain failure has been resolved by some technician, and then tell our abstraction object to heal again. ==== Node ====

Simulating a node failure is relatively easy: halt -nf This has a drawback: the node would then remain dead. So we rather want to reboot it. But here we have to be careful, Since the Exerciser wants to decide when exactly a node is "healed" again.

The nodes need to come up in a certain READY state, which makes it possible for the Exerciser to issue commands on them -- setup our various hardware abstraction objects, or be "healed" now, and (re)join the cluster. but for the other nodes, they should remain "invisible" or DEAD... This basically just boils down to not starting the cluster-manager after reboot, but only on request from the Exerciser. So we now simulate

a node failure by reboot -nf (the -nf means non-flushing, forced; which basically should be identical to hitting the reset button). We of course could use a STONITH device, too, but they are supposedly controlled by our cluster manager, and that might interfere with it. ==== Storage ==== In linux, we can abstract block devices with the linux device mapper. It remaps IO-requests to some underlying device(s) following some target mapping scheme. The two most simple targets of it are "linear" and "error". It allows to change the target mapping at runtime. So a working block device will just be mapped transparently linear. When the Exerciser decides that it failed, it will remap it, or parts of it, to the error target. The next IO-request on that device will then fail with an IO-error, and we expect that this is recognize somewhere, and appropriate action is taken (maybe the node panics, reboots, hangs itself, whatever). ==== Network Links ==== In linux, we can use a special catch-all iptables rule as the first rule in all available tables, and atomically change the target of that rule from ACCEPT to DROP ... In case we test on HA-firewalls or iptables are otherwise used internally, we need to use RETURN and DROP instead. And we obviously have to make sure that we don't cut out the Exerciser itself, so it can revert that change. This needs to be done on all endpoints that would be affected by a real world link failure (NIC specific). Failures of NICs (single endpoints) would be simulated by only DROPing packets via the respective NIC on one specific node. This can be easily extended with rate-limitted drop firewall rule to "randomly" drop packets and see how the cluster communication layer copes with it.

Implementations

We currently have the cts (ClusterTestSuite) and the cth (ClusterTestHarness) ... The cts is intended to be run in the presence of some cluster manager, to verify its proper operation. The cth implements its own sort-of cluster manager (very limited), and is intended to stress particular subsystems or cluster resources (e.g. DRBD) with hardware failures and client load in greater detail.

The cts in its current implementation and concept is the established method of QA of this project, and has proven very useful in catching bugs early, and making sure that fixed bugs stay fixed. Therefore it obviously cannot readily be replaced and all it's features must be preserved going forward. All operations and commands of the Exerciser are asynchronous, i.e. when, or if, they had the desired effect in general has to be verified by some other means. To recognize success or failure of its triggered operations, the cts looks for certain patterns in some consolidated logfile, respecting some timeout. Once you got the log-consolidating right (and you want to have this anyways), this is a very cool concept and handy piece of code. Of course, this imposes the requirement that all actions taken by the Exerciser, and each single event, as well as the respective success or failure, can be precisely identified in this log file by a simple one-line reqular expression search. Though it is possible to wait for N "patterns" in any order, we must make sure that the single patterns, and therfore the log messages, are unambiguous.

(LarsMarowskyBree thinks this is a fairly severe restriction...) The cts is currently "limited" to predetermined "test classes", which are typically called in some "random" order, with a randomly chosen node as argument, in case they need it... If the Test is sufficiently intelligent to recognize that it won't be a good idea to run, it can return immediately by just increasing its skipped count. All tests maintain their specific called, success and failure counts. Currently defined test classes include:

Maybe this should go into the cts page. It is in there anyways, but not as complete as here, and not with my words ...

Flip

If the CM on the given node is up, stop it (leave the cluster gracefully). If the CM is down (but the node is up), start the CM (rejoin the cluster).

Restart

Make the given node leave and then rejoin the cluster. If it currently already left it, this will even join, leave and join.

STONITH

Crash the given node.

I'd suggest to replace this with a more general hardware abstraction, and fail the node, as described above

Start One by One

Ignores the node argument. Makes sure that on all nodes, the CM is stopped. Then on each node in turn, start the CM, wait for it to come up, then proceed with the next one.

Simul Start

Ignores the node argument. Stop the CM on all nodes in the cluster, then start it again on all nodes (quasi) simultaneously. This should catch conflicts in early resource acquisition.

Simul Stop

Ignores the node argument. Stop the CM on all nodes in the cluster (quasi) simultaneously.

Cluster shutdown is a complex code path, and one version of heartbeat actually had a bug in this one. So, that's what this test is about...

Standby Test

Put the given node into standby mode, i.e. if the CM is running on that node, leave it running, but tell it to give up all resources.

Fast Detection

Kill the CM and related processes on the given node ungracefully, and measure how long it takes for the failure to be detected by the remaining nodes.

Bandwidth

Determine how much bandwidth heartbeat is consuming. This is currently done by a tcpdump of 102 udp packets on the heartbeat port, and then parsing the dump for the sizes of each packet, and the timestamp of the first and last one.

Split Brain

Create a split-brain condition in heartbeat, and see if it recovers correctly.

Redundant Path

On the given node, try to break the comm channels in order. Stop as soon as the first one "successfully breaks", perform a full cluster wide audit run, and heal the broken comms again.

Again, I suggest to replace this with a more general hardware abstraction, and failing the link or endpoint.

DRBD

A currently very limited data integrity test for shared-by-DRBD-replication storage. It currently works only if you have at exactly one DRBD device in the cluster, the drbd peers are the first configured nodes (ok, the cts currently expects a two-node cluster anyways),

and you use DRBD 0.6.x; where (x<13), since it requires read access on the drbd devices in Secondary state.

It first tries to wait for an ongoing drbd resynchronization. Then it brings down the CM on both DRBD peers (to make sure nobody is accessing the device), computes and compares the md5sums of the DRBD device on each node (accessing the device in Secondary state, which is no longer possible (or at least generates huge amounts of kernel warnings) with drbd 0.6.13 or 0.7.x). Then starts the CM on both peers again.

How we can improve the CTS

The current cts implementation stores several command sequences in name-value

pairs within its CM class. I suggest to move them as functions into some bash file, similar to what I did for the cth. Advantages: they can easily be re-used by some other testing system, even by simple bash scripts or interactively (again, see how this is done in cth now).

The CtsLab class should be extended with information about several hardware components and maybe topology, so said abstraction can be done, and one can write test classes that fail/heal some of the hardware components. This can be done similar to how cth does it now.

The test classes need to be reviewed for their multi-node (>2) awareness. The audit code needs to be supplemented with CRM-aware audits. Test classes simulating administrative requests should be added. Test classes for simulating client load should be added.

Maybe the cts (the CtsLab) should be configured itself and then push this configuratin into the cluster, instead of parsing the config files on the cluster nodes. Though a similar effect could be acchieved by some wrapper script around the cts.

(LarsMarowskyBree thinks this is a very important feature to easily test several different scenarios. Pointing the test harness at a bunch of nodes which are setup for ssh login and have all required software installed should be sufficient; all other scenario configuration should, as far as possible, come from the test scenario description.)

The consolidated logfile should include all relevant data, such as drbd state at critical points et cetera. If necessary, the test agents should go out and gather this information themselves. But in general, it should be totally unnecessary to manually retrieve additional logfiles for pin-pointing a problem found by the test harness. If this can easily be enabled in a runtime cluster, this will also ease in-the-field debugging and support. Call this a meta-test of the sensibility of our logging, if you will

(LarsMarowskyBree would like it even more if the cluster software internally generated this consolidated logfile without requiring a central syslog server to be configured. Or at least we need to make sure customers set this up correctly and thus have a good howto. Note that standard syslog is lossy and thus not a good idea to use.)

The DRBD test class should be improved... In the long run, it would be cool to have sort of a gui interface in addition

to the RandomTests class. It would present an audit/status window, and in some control window the nodes, comm links and storage devices, as well as the managed services and maybe some simulated clients. Then one could simply klick on some component to fail or heal it, add or remove client load, and see the effect in the audit/status window... With some gtk plugin, this seems affordable effort. Yes, compared to 500.000 automatic test iterations, the QA-effect would be neglegtable. But to debug specific problems, it would be nice to have. And by its high coolness factor, this gadget would probably be cheerfully used