Sitworld: ITM Virtual Table Termite Control Project

Sitworld: ITM Virtual Table Termite Control Project

Draft #2 – 19 January 2015 - Level 1.02000

Inspiration

I recently worked with a customer who had an unstable TEMS environment. There were 3152 agents and 13 remote TEMS. The workload wasn’t high. The instability manifested as remote TEMS missing heartbeats. This resulted in remote TEMSes going offline which was very disruptive. In one 14 hour period the hub TEMS noted 60 missed heartbeats.

One unusual observation in the diagnostic log was that the Listen pipes count was 21 - much higher then normal. I had seen that once before and documented it in the ITM and the 1997 Kasparov vs. Deep Blue Chess Match blog post. In that case there were SP2OS UADVISOR situations that were updating virtual tables at the hub TEMS and those tables were unused and/or unusable.

I saw a mental image of that virtual table update process as termites - silently digesting the building infrastructure until a sudden collapse here and there. The parallel is inexact since the building [TEMS] can be rebuild [restarted] at any time but the image felt right.

Diagnosis

Suspecting the SP2OS type I reviewed the Agent types in that environment. There were 200 Unix OS Agents which would fire off virtual table updates every 3 minutes. There were 80 database [mostly MS-SQL] agents that would fire off every 2 minutes. The actual data volume arriving at the hub TEMS wasn’t high but the impact of the agents sending data in at the same time could destabilize ITM communications. Look at this timeline of theoretical update arrivals over time.

12:02 – 80 DB vtable updates arrive

12:03 – 200 UX vtable updates arrive

12:04 – 80 DB vtable updates arrive

12:06 – 280 UX/DB vtables arrive.

Since initial design ITM defaulted to 16 communication services threads. At ITM 623 FP5 the default was set to 32. At ITM 630 GA, the default was set to 64. This new case was a ITM 623 FP5 case.

By reviewing this ProcessTable trace

error (unit:kdsstc1,Entry="ProcessTable" all er)

I could see that only 33 of 200 UX table updates were being processed. Only 22 of the 80 DB vtable updates were being processed. That discovery proved to me that the vtable updates were seriously impacting TEMS processing. I imagined a parallel where 200 text messages would arrive at cellphone a single second and how that might destablize the cellphone software.

Recovery Action and Successful Result

SQL was manually prepared to delete all of the objects relating to these virtual table updates. The customer ran the SQLs using KfwSQLClient and then restarted all the hub and remote TEMS. There was no loss of function because some of the tables were unused and in any case the tables are partial and thus at best confusing.

The difference was astonishing. In the next 30 hours there was just a single missed heartbeat and that was one which was a remote TEMS scheduled maintenance time. The peak Listen Pipe count dropped from 21 to 2 [two!!]. And most things were running smoothly. The hub TEMS CPU time rose slightly [0.92% to 1.07%] – mainly because all the remote TEMS were staying connected and sending work instead of spending significant help time in Offline status.

Self Help Recovery Tool.

The tool is based on the TEMS Agent Health survey and uses the same proven logic. The binary objects are here.

ITM Virtual Table Recovery Installation

The virtual table recovery package includes one Perl program that uses CPAN modules. The program has been tested in several environments. Window had the most intense testing. It was also tested on zLinux. Many Perl 5 levels and CPAN package levels will be usable. Here are the details of the testing environments.

You might discover the need for other CPAN modules as the programs are run for the first time. The programs will likely work at other CPAN module levels but this is what was most recently tested.

The Windows Activestate Perl environment uses the Perl Package Manager to acquire the needed CPAN modules. The Agent Survey technote has an appendix showing usage of that manager program with screen captures.

Please note: In some environments it is a major problem to install the required up to date CPAN packages. Internet access may not be available or Perl may be a shared resource which you do not have the right to change. Changing such packages could negatively affect other programs. To manage this case a zip file is included: the URL is here. See the History section following the summary if you need earlier levels. You can get this or earlier levels. The zip file is useful for both Windows and Linux/Unix. For Windows the zip file contains a directory "inc" which contains the needed CPAN packages. For Linux/Unix the zip file contains a tar file unix-cpan-inc.tar. Transfer that to the Linux/Unix system and untar it like this:

tar -xf unix-cpan-inc.tar

That will create a directory "inc".

If you need this CPAN directory, add the following parameter to the program invocation "-Iinc" - which is dash then capital I followed by the directory name inc.

perl -Iinc sitvtbl.pl <rest of parms>

In that way the CPAN packages are used only for this one program,

Package contents

The supplied program is sitvtbl.pl and a model sitvtbl.ini file.

To install the virtual table recovery package, unzip or untar the file contents into a convenient directory. On Linux/Unix you will need to use chmod/chusr/chgrp to define the file as executable. The package also includes a model sitvtbl.ini file. The soap control is required [see later for discussion]. The userid and password may be supplied in the agent.ini. In this case the sitvtbl.ini file looks like this

soap <server_name>
user <user>
passwd <password>

The user and password credentials may be supplied from standard input. This increases security by ensuring that no user or password is kept in any permanent disk file. In this case the sitvtbl.ini file would look like this:

soap <server_name>
std

The std option can also be supplied on the command line -std. In either case, a program must supply the userid and password in this form

-user <userid> -passwd <password>

The program invocation would be something like this

mycreds | perl …

ITM Virtual Table Recovery Configuration and Usage

The Agent virtual table recovery package has controls to match installation requirements but the defaults work in most cases. Some controls are in the command line options and some are in the sitvtbl.ini file. Following is a full list of the controls.

The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.

command

ini file

default

notes

-lst

n/a

off

Use KfwSQLClient SQL dumps

-log

log

./sitvtbl.log

Name of log file

-ini

n/a

./sitvtbl.ini

Name of ini file

-debuglevel

n/a

90

Control message volume

-debug

n/a

off

Turn on some debug points

-dpr

n/a

off

Dump internal data arrays

-h

n/a

<null>

Help messages

-v

verbose

off

Messages on console also

-vt

traffic

off

Create traffic.txt [large]

n/a

soap_timeout

180

Wait for soap

n/'a

soap

<required>

SOAP access information

-std

std

Off

Userid/password in stdin

-user

user

<required>

Userid to access SOAP

-passwd

passwd

<null>

Password to access SOAP

Many of the command line entries and ini controls are self explanatory. The following options can be set multiple times: -pc and -tems and -soap. All time base settings are in seconds.

soap specifies how to access the SOAP process with the name or ip address of the server running the hub TEMS. See next section for a discussion.

soap_timeout controls how long the SOAP process will wait for a response. One of the agent failure modes is to not respond to real time data requests. This default is 180 seconds. It might need to be made longer in some complex environments. A value of 90 seconds resulted in a small number of failures [2 agents] in a test environment with 6000 agents.

ITM Virtual Table Recovery Package KfwSQLClient or lst Option

This method is easier to implement then the SOAP option because it does not need any CPAN modules. The work is done on the same system that has a TEPS connecting to the hub TEMS. See the following section "ITM Virtual Table Recovery Outputs" for how to use the results.

Linux/Unix Instructions

1) First place sitvtbl.pl in the directory where it will be run, like /opt/IBM/ITM/tmp

2) Run this command to create a SELECT SQL file named cnodl.sql.
perl sitvtbl.pl -gensql

5) and that should create the show.sql and delete.sql which are the end result.

Windows Instructions

1) First place sitvtbl.pl in the directory where it will be run, like c:\ibm\itm\tmp [creating if needed]

2) Run this command to create a SELECT SQL file named cnodl.sql.
perl sitvtbl.pl -gensql

3) Run this command

\ibm\itm\cnps\KfwSQLClient /v /f cnodl.sql >QA1CNODL.DB.LST

4) Run this command
perl sitvtbl.pl -lst

5) That creates the show.sql and delete.sql which are the end result.
=============================================

The /v options produces progress and result messages.

You could perform the KfwSQLClient command and copy the .LST file somewhere Perl is installed.

ITM Virtual Table Recovery Package soap control

The soap control specifies how to access the SOAP process. For a simple ITM installation using default communication controls, specify the name or ip address of the server running the hub TEMS. If you know the primary hub TEMS a single soap control is least expensive.

If the ITM installation is configured with hot standby or FTO there are two hub TEMS. At any one time one TEMS will have the primary role and the other TEMS will have the backup role. If the TEMS maintenance level is ITM 622 or later, set two soap controls which specify the name or ip address of each hub TEMS server. The TEMS with the primary role will be determined dynamically.

Before ITM 622 you should determine ahead of time which TEMS is running as the primary and set the single soap control appropriately.

Connection processing follows the tacmd login logic. It will first use https protocol on port 3661 and then use http protocol on 1920. If the SOAP server is not present on that ITM process, a virtual index.xml file is retrieved and the port that SOAP is actually using is retrieved and used if it exists.

Various failure cases can occur.

The target name or IP address may be incorrect.

Communication outages can block access to the servers.

The TEMS task may not be running and there is no SOAP process.

The TEMS may be a remote TEMS which does not run the SOAP process.

The SOAP process may use an alternate port and firewall rules block access.

The recovery actions for the various errors are pretty clear. If (5) is in effect, consider running the survey package on a server which is not affected by firewall rules. Alternatively, always make sure that the hub TEMS is the first process started. If it must be recycled, then stop all other ITM processes first and restart them after the TEMS recycle. See this blog post which shows how to configure a stable SOAP port at the hub TEMS.

If the protocol is specified in the soap control only that protocol will be tried.

soap https://<servername>

When the port number is specified in the soap control, 3661 will force https protocol and 1920 will force http protocol.

soap <servername>:1920

The ITM environment can be configured to use alternate internal web server access ports using the HTTP and HTTPS protocol modifiers. For this case you can specify the ports to be used

soap https://<servername>:4661

or if both have been altered

soap https://<servername>:4661

soap http://<servername>:2920

The logic generally follows tacmd login processing. There are two differences: ipv6 is not supported and port following ITM 6.1 style is not included. SOAP::Lite does not support ipv6 at present. ITM 6.1 logic could be added but is relatively rare and was not available for testing.

ITM Virtual Table Recovery Outputs

There are two files of SQL produced show.sql and delete.sql. The easiest way is documented here Do It Yourself TEMS Table Display. Use the KfwSQLClient method which is the second one listed.

First run the show.sql and determine if the problem objects are present. If not you can stop here. Some may be missing because the SELECTS look at all possible problem objects and you may not have all such agents installed.

Second run the delete.sql. In general there will be no error messages.

Third run the show.sql again. This should return no data.

If some remote TEMS are offline, this may need to be repeated. There is no harm in deleting objects that does not exist.

If a new remote TEMS is created, rerun this process.

If TEMS maintenance is performed, rerun this process.

When the objects are deleted, recycle the hub TEMS [and the backup hub TEMS if FTO is used] and all the remote TEMSes. This does not need to be done all at once, but the benefit will be gradually seen as each one is recycled.

Long Term, the agents involved will have new application support files which will not have these virtual table update objects.

Summary

Feedback Wanted!!

Please report back experience and suggestions. If virtual table recovery package does not work well in your environment, repeat the test and add “-debuglevel 300” and after a retry send the sitvtbl.log [compressed] for analysis.

History and Earlier versions

If the current version of the virtual table recovery tool does not work, you can try older published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue test intermediate levels to isolate where the problem was introduced.