CA Automation Point customers want to know that all of their CA Automation Point machines are running properly. They also want to be able to "hot swap" machines in case of a hardware error. You can use a combination of REXX programs, PPQs, and rules to "heartbeat check" between two CA Automation Point machines.

Question:

How do I setup a heartbeat checking between two CA Automation Point computers?

Environment:

CA Automation Point r11.4.x and r11.5.x

Answer:

To start, you need to configure program-to-program queues (PPQs) on both CA Automation Point machines. PPQs are an inter-process communications tool. They are small data repositories that can be accessed via TCP/IP between Automation Point machines. In this article, we discuss how to use PPQs to pass an "I'm Alive" message between two Automation Point computers.

To configure PPQs, on each of the two Automation Point machines, do the following:

From the Configuration Manager dialog, go to Expert Interface -> Infrastructure -> Program to Program Queues. The Program to Program Queues dialog displays.

On the Program to Program Queues dialog, check the Enable Use of PPQs box and make sure that TCP/IP is included under Configured Network Transports.

Under TCP/IP Settings, enter the TCP/IP hostnames or IP addresses of the remote Unicenter Automation Point machine with which you want to communicate.

The PPQ Service starts when you close Configuration Manager.

Once PPQs are configured, you need to write a REXX program that creates the queues shared between the Automation Point computers. The REXX program attempts to create a shared queue between machines. The name of the shared queue is the Automation Point machine name.

In this example, we configure a REXX program so that it starts as soon as Unicenter Automation Point starts, on each of the Automation Point computers. The first computer to start creates the shared queue. The REXX program would look like this:

/* Hbeat_start.rexx */
/* First we initialize our REXX variables to 0, then try to create */
/* the shared PPQ queue. If the create fails, we send a message */
/* to the AP message window, which can be automated by rules */

Now, we can set up the rules file to perform the heartbeat checks. In this case, assume we have a time rule set that fires a REXX program.

The rule would look like this:

TIME(00:00), EVERY(5 MINUTES)
REXX(HBEAT.REX)

PPQs can be manipulated directly from rules, but REXX programs are much more flexible. The REXX program first writes a "|" to the proper PPQ queue, reads the proper PPQ queue, and set two variables remotemachinename_status and remotemachinename_failure. Replace machinename with the local Automation Point machine name and remotemachinename with the remote Unicenter Automation Point machine name. If the queue is read successfully, the remotemachinename_status variable is set to 1. Otherwise it remains 0. The remotemachinename_failure variable counts how many consecutive times the program fails to read the queue. If the remotemachinename_failure variable becomes greater than 3, the program sends a message to the Automation Point messages window.

The REXX program would look like this:

/* HBEAT.REX */

/* We write our heartbeat message to the proper queue. If we do not */
/* get a 0 return code, we do a wtxc, which can have a rule written */
/* against it to do a notification. The user should change */
/* machinename to the name of the remote PC */