ITM Silver Blaze – Agent Responsiveness Checker

By John Alvord

IBM Corporation

Please note. While still interesting, this project has been largely superseded by

Inspiration

I have a great job. People come to me with puzzles and I get paid to investigate. A recent customer had 400+ Solaris systems running Unix OS Agent at ITM 622 FP5 and earlier levels. By chance they identified a single instance of a Unix OS agent that was not running situations. They were naturally worried there could be other cases.

Introduction

In ITM, there are occasionally agents that report online but are not running situations. When real time data is requested the request times out. I call them non-responsive agents and have puzzled for a years about how to detect them easily.

If you suspect a non-responsive agent, you can attempt to view real time data and observe the time out condition. That requires expensive manual work for each agent and you can never be sure if things remain good. Once you find a single such case you will worry every day and night. A single situation not firing can be costly.

With this new inspiration, I remembered a famous Conan Doyle short story about Sherlock Holmes titled Silver Blaze. Sherlock resolved a mystery by noting that a watch dog did not bark in the night.

Silver Blaze Overview

There are three components to the Sliver Blaze scheme: A situation, a workflow policy and a Perl program. For an example implementation Right-click/Save As... ===> zip file. This goal of this example is to identify all non-responsive Linux OS Agents. The files can also be found here: https://github.com/jalvo2014/silverblaze

The situation does not need to run at startup since it used only by a workflow policy. The KLZ_System_Statistics attribute group is used because it has a System_Name – or agent name – among the attributes.

The Linux/Unix touch command creates a zero length file or updates time on an existing file. See Appendix 1 for Windows batch file wintouch.bat to accomplish the same thing. The Windows example is included in the example files.

The example workflow policy has the same distribution as the situation: *LINUX_SYSTEM. This means the policy is active on each TEMS where a Linux OS agent connects. The Take Action options force the command run on the hub TEMS. The workflow policy correlation is “managed system”.

When the policy is started [or auto_started], the situation is automatically started on each Linux OS Agent. The situation runs in results-only mode and does not create events. Every 15 minutes the agent sends a new result to the TEMS. The workflow processes the result and then runs a command on the hub TEMS.

In this example the /tmp/ directory was used for the touch files. You can of course pick any target directory.

After the situation sends results and the workflow policy runs, the /tmp directory fills with files having the names of Linux OS Agents which are active and processing situations.

d.Print out names of touch files which are not listed as online agents

e.Print out the names of touch files which are late by some predetermined number of seconds.

This example Perl program is configured with user specified values at the beginning which tell how many seconds is considered late, user/password for tacmd, target directory for files, etc. The itm_unresp.pl has been tested on both Linux and Windows.

For Linux/Unix, the location of Perl is specified in the first line of itm_unresp.pl

#!/usr/bin/perl -w

If that is different on the system that itm_unresp.pl will run on, you will need to change that first line. On Windows the perl program libraries are present in the PATH after installation and that line is ignored. The "-w" enables certain warnings.

Controls for itm_unresp.pl

These controls are in the beginning of the itm_unresp.pl source. Modify them to match your requirements.

In this test lateness was defined as 30 seconds and the situation sampling interval was set to 60 seconds. This setup deliberately forced lateness messages. The modify value is the epoch seconds when the file was last modified. The late value is the current epoch time minus the lateness seconds defined in itm_unresp.pl.

The missing touch file message was produced by an option to add a fake online node.

There is a message type “node $node not in online capture” which means there is a touch file present but the agent is not currently online. I suspect that means the agent has gone offline and the touch file should be deleted. That logic is not yet implemented.

Alternative setups

It might be inconvenient or impossible to run the itm_unresp.pl program on the hub TEMS. If so pick any system with a Windows/Linux/Unix OS Agent. Change the workflow policy so the touch [or Windows wintouch.bat] command runs on that Agent. Then you can run the itm_unresp.pl summary program on that same system with the same results.

Linux/Unix systems usually come with Perl already installed. If your target is a Windows system, then install Perl from www.activestate.comwhich has an excellent free version. The itm_unresp.pl program only uses built in or core facilities.

Outstanding Customer Results

Using the Silver Blaze scheme the customer determined that 167 agents were stalled after roughly an hour of effort. A study of 115 agent operations logs revealed evidence of a defect corrected at ITM 622 FP6 when the TEMA threading logic was reworked. An upgrade to ITM 623 FP2 was already underway and was thereby accelerated. Updating each Unix OS Agent was sufficient to resolve the issue for all ITM agents running on each system. In the meantime, stuck agents were recycled as needed and monitoring continued.

The underlying ITM issues have been resolved over time, but not everyone runs the latest maintenance level. In addition, the problem can be environmental like a mount point full or some competing process in a loop. [See Appendix 2 five APAR fix examples.]

Having a centralized facility to identify non-responsive agents will speed resolving such issues. Until the problem can be corrected, early identification and recycling will reduce the exposure time running agents in a non-responsive mode.

Summary

This scheme provides a way to view non-responsive agents reliably. It can also be used as a long term checker for these issues. After an initial scan and cleanup, the sampling interval should be changed to once a day or so.

In practical use, you would create a wintouch.bat file based in the example zip file and save it at on Windows system in a known position. In the Workflow Policy take action command set the fully qualified name of the wintouch.bat command file. The itm_unresp.pl command is aware of the changed form of the Agent name and will make the right tests when run on a Windows system.

Appendix 2: Non-responsive agent APAR fix examples

These are examples of APAR fixes which handled cases where an agent might end up non-responsive. The list is not complete but area ones I remember. These are rarely observed when an agent is running with up to date maintenance. There are also many environmental problems which can have the same result.

First is a case where the Agent Support [TEMA] threading model needed work to avoid a deadlock. It could theoretically happen on any Linux/Unix environment but in practice was only seen on the Solaris Unix OS Agent. Corrected in ITM 622 FP6 and in ITM 623 FP1.