Looking for current and maintained information and documentation on (Linux ) Open Source High Availability HA Clustering?You probably should be reading the Pacemaker site clusterlabs.orgThis site conserves Heartbeat specific stuff. See Site news for details.

Writing your own OCF Resource Agent mini Howto

Anything found in the /usr/lib/ocf/resource.d/heartbeat is provided as part of the resource-agents (resp. cluster-agents) package, which you should install together with Heartbeat and Pacemaker. When creating your own agents, you are encouraged to create a new directory under /usr/lib/ocf/resource.d/ and use provider={your subdirectory name}. So, for example, if you want to name your provider dubrouski, and create a resource named serge, you would make a directory called /usr/lib/ocf/resource.d/dubrouski and name your resource script /usr/lib/ocf/resource.d/dubrouski/serge.

For convenience, many of the return codes, defaults and other OCF utility functions are available to be included by custom OCF agents from /usr/lib/heartbeat/ocf-shellfuncs

Beware: Linux-ha implementation have been somewhat extended from the OCF Specs, but none of those changes are incompatible with the OCF specification.

When writing/testing your OCF Resource Agent, you may find the ocf-tester script to be very useful. It comes in the resource-agents package (resp. cluster-agents, on Debian based distros).

Actions

Normal OCF Resource Agents are required to have these actions:

start - start the resource. Exit 0 when the resource is correctly running (i-e providing the service) and anything else except 7 if it failed

stop - stop the resource. Exit 0 when the resource is correctly stopped and anything else except 7 if it failed.

monitor - monitor the health of a resource. Exit 0 if the resource is running, 7 if it is stopped and anything else if it is failed. Note that the monitor script should test the state of the resource on the localhost.

meta-data - provide information about this resource as an XML snippet. Exit with 0

Note: OCF specs have strict definitions of what exit codes actions must return. We follow these specifications, and exiting with the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state.

OCF Resource Agents should support the following action:

validate-all - validate the set of configuration parameters given in the environment, exit with 0 if parameters are valid, 2 if not valid, 6 if resource is not configured, 5 if the software the RA is supposed to run cannot be found.

Additional requirements (not part of the OCF specs) are placed on agents that will be used for cloned and multi-state resources.

promote - promote the local instance of a resource to the master/primary state. Should exit 0

demote - demote the local instance of a resource to the slave/secondary state. Should exit 0

notify - used by the cluster to send the agent pre and post notification events telling the resource what is or did just take place. Must exit 0

reload - reload the configuration (non-unique parameters only) of the resource instance without disrupting the service

migrate_from / migrate_to - perform live migration of a resource

recover - a variant of the start action, this should try to recover a resource locally (currently not used by Pacemaker).

Parameters

In addition to having more actions, your OCF resource agent is permitted to take parameters
to tell it which instance of the resource it is being asked to control, and any simple
configuration parameters it might need to tell it what to do or exactly how it should be done.

These are passed in to the script as environment variables, with the special prefix OCF_RESKEY_.
So, if you need to be given a parameter which the user thinks of as ip it will be passed to the script as OCF_RESKEY_ip.

Debugging your OCF Resource Agent

The most common problems when implementing OCF Resource Agents are:

Not implementing the monitor operation at all

Not observing the correct exit status codes for start/stop/monitor actions

Starting a started resource returns an error (this violates the OCF spec)