Sitworld: TEMS Database Repair

Draft #2 – 24 November 2016 - Level 1.01000

Introduction

The TEMS database tables are used to store user data such as situation descriptions and distribution definitions. They also keep running data such as current situation status on agents. There many more internal and functional tables.

When the files holding the data are damaged and the TEMS usually malfunctions. Over the years there have been many reasons for such damage. Here are some examples

TEMS exception and failure.

File system full.

Unwise manual changes or restoring from a backup that wasn't taken correctly.

Power outage without any UPS backup.

SAN [Storage Access Network Device] failure.

System shutdown without stopping the TEMS.

Many unexplained instances.

Hub TEMS Recovery Attempt WARNING!!!

A primary hub TEMS is the repository of fundamental user data and any recovery of that is a delicate operation which can easily result in a reinstall and significant downtime. Please work with ITM support in planning a hub TEMS data recovery. Remote TEMS can be recovered quite simply as can a FTO backup hub TEMS.

In addition you should have a Backup/Recovery plan for hub TEMS data. See this document for five different ways to accomplish this goal. A simple backup of the files while the TEMS is running is inadequate and can lead to significant downtime. These are hot database files and many constantly change and are tightly connected.

Non-Hub TEMS Recovery

The process is very simple although it varies by platform [hardware and operating system] and by TEMS maintenance level. From a high level view you stop the TEMS [if running], replace the database files with emptytable files and then start up the TEMS and let the hub TEMS refill with correct data naturally. A reference to the files follow. They are not exactly empty. At the very least they contain a “end of objects” record and some are pre-loaded with data. The ones here were accumulated from install media builds from ITM 6.2, ITM 621, ITM 622, itm 623 and ITM 630. They are the exact files you would lay down during a new TEMS install.

There are three types of files:

Bigendian – for Unix [AIX/Solaris/HPUX] and Linux on Z

Littleendian – for Linux/Intel and Windows

VSAM – z/OS index sequential file

The references here are to a zip file for each maintenance level. Each zip file contains a bigendian.tar file, a littleendian.tar file and a Windows zip file. The last two contain identical files but are packaged differently for convenience. With z/OS the story is quite different, see later.

Windows Recovery for non-hub TEMS

Select the correct maintenance level and load the proper zip file from the links above. Unzip that file and you will use the .zip file included.

Unzip that file into some convenient directory – we will assume C:\TEMP but it can be anyplace. You will see a lot of QA1*.DB file and QA1*.IDX files.

Stop the TEMS

Copy the files, for example [adjust for actual install directory]

cd c:\IBM\ITM\cms
copy c:\temp\QA1*.*

You could also use Windows explorer. You may also wish to make a safety copy of those files.

Start the TEMS

Monitor for correct operation.

Recovery complete

Linux/Unix Recovery for non-hub TEMS

Select the correct maintenance level and load the proper zip file from the links below. Most environments will have a gunzip command. If not you can unzip on some convenient Windows workstation.

Select the proper endian type. Bigendian is for all Unix and Linux on z systems. Littleendian is for all Linux/Intel systems. For this example we use linux at ITM 630 and the file is ITM630_emptytables.littleendian.tar and it is assumed to be copied to /opt/IBM/ITM/tmp

Move that littleendian file to the system where the TEMS runs and un-tar it.
cd /opt/IBM/ITM/tmp
tar -xf ITM630_emptytables.littleendian.tar
This will create many QA1* files

At this point you have to determine the attributes/owner/group the current TEMS files. You could do that with these commands
ls -l /opt/IBM/ITM/tables/<temsnodeid>/QA1CSTSH.DB
which in my zLinux test environment looks like this:
nmp180:~ # ls -l /opt/IBM/ITM/tables/HUB_NMP180/QA1CSTSH.DB
-rwxr-xr-x 1 root root 35274789 Nov 14 21:03 … QA1CSTSH.DB
[Above line shortened for display purposes.

Next change the un-tar’d files to what is currently being used and what the TEMS expects. Remember the following is just an example that would be used in my environment. You will run the command appropriate to your actual environment,
cd /opt/IBM/ITM/tmp
chmod 755 QA1*.*
chown root QA1*.*
chgrp root QA1*.*

Next stop the TEMS

Next copy the emptytable files into the directory where the TEMS expects them
cd /opt/IBM/ITM/tables/<temsnodeid>
cp /opt/IBM/ITM/tmp/QA1*.* .

Note the trailing period which means to copy to the current directory.

Next start the TEMS

Monitor for normal operations

End of recovery

z/OS recovery for non-hub TEMS

Please note: this is hardly ever needed. The last PMR I worked on *looked* like it was needed but the symptom was actually a harmless TEMS message [actually a defect] that complained about a table… and there was no actual problem at all! So I expect it is very rare to have to do this procedure.

Always involve IBM Support if you have any uncertainty at all in this process. Also, if you *think* you know more about z/OS than the author – you are very likely correct!!

z/OS recovery with ICAT configuration

The following uses QA1CSTSH as an example.

1) Stop the TEMS task

2) Delete or rename the QA1CSTSH VSAM dataset. If unsure, examine the Joblog output to determine the complete dataset name.

3) Proceed to ICAT and navigate to the 'Runtime Environments' panel (KCIPRTE)

4) Place a 'B' next to the RTE [Run Time Environment] that contains the TEMS that owns the file you wish to recreate.

5) That will generate the DS#1xxxx job which should then be submitted.

6) The job will detect the file that is missing and recreate ONLY that file.

7) The job should complete with condition code zero

8) The TEMS can then be started.

z/OS recovery with PARMGEN configuration

The general idea is the same as ICAT.

For steps #3 - #7, it can be replaced w/ similar instructions here. That documents how to reallocate PDS files but the path followed is the same. Following are some notes from the Parmgen expert.

The job would vary - you can use KDSDELJB as a model job that has the deletes but only make it specific for RKDSSTSH VSAM

(//QA1CSTSH DD DISP=SHR,DSN=&RVHILEV..&SYS..RKDSSTSH.)

Submit the composite KCIJPALO job same as in the doc., and for the standalone job, refer to the PARMGEN KDSDVSRF - needs to be modified of course.

Hub TEMS – if you absolutely have no choice

There are many TEMS hub database tables which you can reset only by losing significant data and undergoing a long manual reinstall and rebuild. This could mean a week or more of outage. It is very important to involve IBM Support if you have any doubts at all.

However there are a few tables which can be reset with no real impact. These 3 tables contain internal processing data, not user data.

TSITSTSC – QA1CSTSC: The Situation Status Cache which is reused every time the TEMS starts.

TSITSTSH – QA1CSTSH: The Situation Status History. This is an intermediate file where situation event status collect. It is a wraparound table and defaults to 8192 rows. At hub TEMS startup all the remote TEMSes and agents [if directly connected] send current status. Therefore you only miss situation status history after a reset. Since there are no ITM functions which display or use the history, nothing much is lost by resetting it to emptytable status.

[several tables] – QA1CDSCA: This is the combined catalog table. If this is reset to emptytable status, at TEMS startup the pre-defined data is updated based on the existing package [like klz.cat] files. Therefore it can be reset to emptytable status and nothing is lost. As a minor point, TEMS has an extremely hard limit of 512 packages. At 513 the TEMS will crash and not come up. It is pretty rare but definitely something to keep aware of.

In each case you would do the same as a complete replacement but only handle the QA1*.DB and QA1*.IDX file.

Backup/Recovery best practice

The following document was co-authored with L3 TEMS and represents the best current thinking. It gives five ways to create a valid useful and reliable backup of the TEMS database files.

Sitworld: The Encyclopedia of ITM Tracing and Trace Related Controls

Draft #2 – 23 September 2016 - Level 1.01000

Introduction

ITM tracing is at once the most tiresome of topics and sometimes the most important. This post will collect everything I have collected and discovered over the years. Expect to see many additions and corrections over time.

Chapter 1 is dedicated to controlling diagnostic log file sizes.

Chapter 2 is dedicated to describing operation log files - what they are and where they are found.

Further chapters are being prepared. There is also some testing to be performed and some of the model control files might change.

Chapter I - Control of diagnostic log file size and location.

ITM Diagnostic log file size and location control differ by platform [Linux/Unix, Windows, z/OS and i/5]. This diagnostic log information contains detailed process information. By default – when control set to ERROR – you see error messages and any information level messages. The error messages do not always mean an actual error condition. The entire goal is to help IBM Support to understand product problem issues. This chapter shows how to control the size and location.

All of the examples assume the default install location. In practical usage you will specify the directories actually chosen for the particular installation.

Linux/Unix

Following is the best practice from ITM 623 GA onward. Earlier best practice is described at the end.

In the /opt/IBM/ITM/config directory create a file xx.environment which has the same attributes/owner/group as xx.ini. For example, lz.environment and lz.ini or ms.environment and ms.ini. If one already exists, just use it. Here are example commands to create a new such environment file.

These environment variables are applied just before starting the process else and will take effect without any reconfiguration. A recycle is needed of course. When testing is over they can be commented out or deleted or the whole file deleted.

Other environment variables can be specified, but we are working here on just KBB_RAS1_LOG where characteristics like diagnostic log segment sizes are defined. If this is absent, ITM has some built in defaults.

Following are the environment variable lines to enable KBB_PRAS1_LOG. Further explanations follow. The lines are separate by blank lines here for clarify but blank lines are not required.

LIMIT size in megabytes of diagnostic log segments. I once had to use 200 megs and 16 could to capture 24 hours log.

PRESERVE=1 make sure the first segment is preserved

MAXFILES total number of files to be kept. This *must* be larger than COUNT

%(syspgm) will be replaced by main program like klzagent or kdsmain

%(sysutcstart) will be replaced with start epoch UTC time in seconds

The maximum space used will be MAXFILES*LIMIT in megabytes.

Before ITM 623 GA, best practice was as follows:

Create a file xx.override in /opt/IBM/ITM/config with the same attributes/owner/group as xx.ini.

Put within that file the change you want to introduce. Compared to modern setups the values need to be specified within single quotes such as
KBB_VARPREFIX='%'

Instead of calculating RUNNINGHOSTNAME and PRODUCT on the fly, instead figure out what they should be and add them to the KBB_RAS1_LOG definition.

In the xx.ini file add the following line

. /opt/IBM/ITM/config/override

This is known as a source include definition.

For many agents this is sufficient. TEMS is a instanced agent and more work is needed.

The first method is to do a ./itmcmd config -S -t <temsname> and accept all defaults.

The second method is to add the source include line into the hostname_ms_temsname.config file at the end.

Windows

Windows diagnostic log segments have the same form as Linux/Unix.

Tracing is defined via the Managed Tivoli Enterprise Monitoring GUI.

Right click on agent [such as TEMS or Windows OS Agent]

Select Advanced

Select Edit Trace Parms…

There are entry areas for

Maximum Log Size Per File [LIMIT]

Maximum Number of Log Files Per Session [COUNT]

Maximum Number of Log Files Total [MAXFILES]

As usual a recycle is needed to implement.

z/OS

The diagnostics log information is included in the RKLVLOG SYSOUT file.

When intensive tracing is configured this can grow to a large size. If that happens, the following command will close the current SYSOUT file and start another. In that way the current log can be captured to a disk file. That is often performed using SDSF after the switch to “print” to a disk file.

/f cmstask,TLVLOG SWITCH

The CLASS option can be used to specify the output class of the new sysout file.

/f cmstask,TLVLOG SWITCH CLASS=W

z/OS logs are not configurable with KBB_RAS1_LOG. Configuration variables are kept in the RKANDATU PDS member KxxENV.

i/5

The i/5 platform [previously AS/400] uses the following form for KBB_RAS1_LOG. This environment variable is placed in QAUTOTMP/KMSPARM file member KBBENV.

KBB_RAS1_LOG=

(QAUTOTMP/KA4AGENT01 QAUTOTMP/KA4AGENT02 QAUTOTMP/KA4AGENT03)

INVENTORY=QAUTOTMP/KA4RAS.INV

COUNT=3

LIMIT=5

PRESERVE=1

MAXFILES=20

KBB_RAS1_LOG is specific as one lone line although it is presented here on separate lines for clarity.

The meaning of the controls is identical to Linux Unix.

Chapter 2 Operations log

The operations log is a high level view of ITM operations. It bears a close relationship to the ITM Universal Message Console. Universal Messages are written into a wrap around in storage table and are also written to the disk log for later analysis. The interesting benefit here is that you can write situations against the Universal Message Console. Each ITM process has a universal message console and if you want more details see this document:

Linux/Unix ITM Operations Log

The TEMS operations log is contained in the /opt/IBM/ITM/logs directory and has the name format

<hostname>_ms_<epoch>.log

Where

<hostname> names the server the TEMS is running on

<epoch> is a decimal number corresponding to the POSIX epoch - the number of seconds since 1 January 1970 not counting leap seconds at the time the log started. Diagnostic logs use a hex representation of epoch in the logs.

ITM Agents on Linux/Unix are contained in the same /opt/IBM/ITM/logs directory and has the name

<agent_name>.LG0

Where

<agent_name> is the managed system name [or Agent Name]

During each startup, the current <agent_name>.LG0 is copied to <agent_name>.LG1.

The Warehouse Proxy Agent operations log is not kept as a disk file. Those logs are preserved in a data warehouse table.

Windows ITM Operations Log

The Windows TEMS operations log is in the C:\IBM\ITM\cms directory and has the name kdsmain.msg.

The Windows Agent operations log has the same name format as Linux/Unix but any colons [:] in the agent name are converted to underlines. That is required because the Windows operating system does not permit colons in file names. The location depends on many factors including 32 versus 64 bit agent instance.

z/OS TEMS and Agent operations log

There is no separate log for z/OS operations log. Instead it is inter-mixed with diagnostic log lines in the RKLVLOG SYSOUT file.

Sitworld: ITM2SQL Database Utility

Version 1.37000 11 November 2016

Inspiration

I have long envied the TEPS facility called migrate-export, which takes the TEPS database and creates a file of SQL commands to recreate the database.

Recently I found a way to accomplish this on a live ITM system using the KfwSQLClient utility. It does not have an obvious use case to solve a specific problem but can be useful in viewing TEMS database contents and producing data for ad-hoc reports.

The result is not suitable for a backup - see Best Practice TEMS Database Backup and Recovery because tables are often jointly updated. A capture of a table and a capture of a second table a few seconds later can miss the combined update. I have even seen cases where a capture of a single table gets inconsistent results. That was getting the node status table from a remote TEMS when there were a terrific number of duplicate agent name cases constantly changing the table.

More than anything, I had wanted to do it for a long time and after years figured out a way to accomplish the goal. If nothing else there is a pleasure in satisfying that desire and sharing the results.

Overview

The itm2sql.pl utility uses the TEMS catalog file kib.cat and the running system to capture a TEMS table current contents and produce a file of

1) INSERT SQL statements
2) Tab Separated Variables suitable for a spreadsheet program3) Text file with fixed length columns for easy reference, sorting and searching
4) An index only file which is useful when comparing two tables for differences

ITM2SQL Package Installation

I suggest itm2sql.pl be placed in some convenient directory, perhaps a directory itm2sql of the installation tmp directory. That is what examples will assume. For Windows you need to create the <installdir>\tmp directory. You can of course use any convenient directory.

Linux/Unix: /opt/IBM/IBM/tmp

Windows: c:\IBM\ITM\tmp

Linux and Unix almost always comes with Perl shell script installed. For Windows you can install a no cost Community version from www.activestate.com if needed.

The hub TEPS should be connected to the current primary hub TEMS. That can be running on any platform: Linux/Unix/Windows/zOS.

ITM2SQL and Catalog files

The itm2sql.pl processing requires a matching catalog file. This will be found in a TEMS install at

Linux/Unix: <installdir>/tables/<temsnodeid>/RKDSCATL

Windows: <installdir>\cms\RKDSCATL

z/OS: RKANDATV

For Linux/Unix/Windows the usual catalog name will be kib.cat. For z/OS the member name will be KIBCAT.

Copy that file to the directory where itm2sql.pl will be used - for z/OS the name will have to be changed of course. If the TEPS is running on the same system as the TEMS, you can just supply the fully qualified name and not use a copy.

The kib.cat gives you references to most of the tables that TEMS uses. There are other TEMS components - e.g. remote deploy - which uses other catalogs. Remote deploy uses kdy.cat. This document does not mention such other catalogs further.

ITM2SQL Usage

The ITM2SQL package has parameters to match installation requirements. Following is a complete list.

The following table shows all options. Extensive notes follow the table.

command

default

notes

-d

off

produce debug messages on STDERR

-help

off

Produce help summary

-home

default install location

Directory for TEPS install

-ix

off

Produce a show keys only output

-l

off

INSERT SQL output with prefix count/keys

-o [file]

STDOUT

Output to a named file or by report type

-qib

off

Do not ignore the columns starting QIB

-s key

off

Name a key, can have more than one

-si file

off

Process only named keys in index file

-sx file

off

Exclude named keys in index file

-testf file

off

Process a previously captured listing file

-txt

off

Fixed column text file

-tc

columns to process with -txt

-f

Favorite columns to process with -txt

-tlim

256

maximum column bytes to display, 0=all

-tr

off

translate tab/carriage return/line feed to blank

-v

off

Tab Separated Variable

-work directory

TMP or TEMP

where to create work files

Following the option parameters are two positional arguments. The first is the catalog file - often kib.cat. The second is the name of the table to be processed.

Notes

1) -home if unspecified use environment variable [Windows CANDLE_HOME] or [Linux/Unix CANDLEHOME]. If those are absent use default install locations [Windows C:\IBM\ITM] or [Linux Unix /opt/IBM/ITM].

2) -ix is used to create a show keys only output. You must specify at least one key using -s and the combination of keys must make the reference unique. The resulting file can be used in -si or -sx to include or exclude those keys. This is extremely useful when comparing a capture at one time with a later time - or if comparing one hub TEMS with another hub TEMS.

3) -ix and -txt and -v and -l are mutually exclusive. If all are missing the default is the INSERT SQL output format is produced.

4) -o with no output file [and followed by another - option] will pick a name based on table name and period and [txt=TXT, v=TSV, ix=IX, l=LST, default=SQL]. -o with a following name will use that as output file. If -o is absent, results are printed to standard output.

5) -qib will include columns beginning with QIB which are not represented in the disk files and thus relative uninteresting.

6) -s key - internally these are known as "show" keys because they will present at the beginning of the -l output type. ln combination they should uniquely identify the object.

7) -txt - output report in a fixed column width presentation. The width of column also depends on the length of the column name and a blank is left between columns. This can be useful to feel into your own ad hoc reports.

8) -tc column - list of the columns to display. You can have more than one or you can use multiple columns separated by commas -tc col1,col2,col3

11) -tr - some columns have spacing controls like tab, carriage return or line feed. This can make the -txt output look strange. With -tr they are replaced by blanks.

12) -v - produce .TSV or tab separated variable output format. This can be opened with a spreadsheet program

13) -work - specify a work directly for temporary files. If -work not specified environment variables [Windows TEMP] or [Linux/Unix TMP] are used. If those are absent the [Windows C:\TEMP] or [Linux/Unix TMP] is used. If -work itself is absent, the current directory is used.

TEMS tables

Here is a list of TEMS tables of some interest. There are over 50 such tables but the following are the ones I find of interest, The columns are not listed here. You could do an SQL type report and get a list of the table columns.

Override definitions

CCT

Portal Client Take Action

EVNTMAP

Event Mapping

EVNTSERVER

Event Server

INODESTS

In core node status

ISITSTSH

In core Situation Status History

TACTYPCY

Workflow Policy Activities

TAPPLPROPS

SDA Support - 623 FP1

TCALENDAR

Calendar definitions

TGROUP

Group

TGROUPI

Group Entries

TNAME

Long Situation Name and index

TNODELST

Node List - online and MSLs

TOBJACCL

Object Access - distribution

TOVERITEM

Override Items

TOVERRIDE

Override definitions

TPCYDESC

Workflow Policy Description

TSITDESC

Situation Description

TSITSTSC

Situation Status Cache

TSITSTSH

Situation Status History

Example usage

To make maximum usage of this utility, you would need to know the TEMS schema and logic. Since that is not published, the process will be more experiment and discovery. Some tables need different catalog files, such as kdy.cat for tables connected with remote deploy.

Sitworld: Policing the Hatfields and the McCoys

Draft #1 – 2 May 2016 - Level 0.5000

Inspiration

One more time I had to explain to a customer that you could not have a situation formula that included more than a single multi-row attribute group. They had a worthy goal: they wanted to test for a missing process - but only if that process was installed on the system being monitored. Process attribute groups are a multi-row and file information attribute groups are multi-row and this is an illegal formula. The Portal Client Situation editor would have foiled them, since after the first multi-row attribute group is selected, only single row attribute groups are offered when adding the next attribute test. However, like many customers, they used a tacmd editsys to update to formula to what they wanted. I have seen this a couple times a year "forever".

I was involved because that "monster" ITM situation flooded a remote TEMS with results and did not even achieve the desired effect. The Missing Process situation fired even though the software was not installed. In any event, the remote TEMS overload was so severe that the remote TEMS failed after a few hours. The Situation Audit static analysis tool pointed to the issue and the TEMS Audit tool reported on the massive workload caused by the errant situation. The remote TEMS overload would have been an amazing 100 times more severe except that 100+ such situations had a syntax error which prevented them from running. That is all too common when manually creating situation formula. [One review showed 30% of ITM environments having at least one situation with a syntax error.]

On the other had, the need was real and had been available in a previous monitoring solutions. Two multi-row attribute groups are like two feuding clans - like the legendary Hatfields and McCoys. They just don't get on at all and there is a lot of collateral damage.

Background

ITM situations are represented by SQL. To make this more concrete here is a simple situation formula for an Agent Builder Agent

It is a fact of ITM life that the SQL for a situations will only have a single table [equivalent to attribute group at this level.] The TEMA or Agent Support library only handles a single table.,

If multiple attribute groups were available, logic would have to be prepared to define a key to connect the two attribute groups something like this

WHERE ... K08K08FIL0.ATTRIBUT13 = K09K09MEM0.ATTRIBUT9 ..

However ITM has no place to make that definition and no logic to process it correctly if it was present. This is a clear product limitation no matter which way you look at it.

TEMS does handle the case of a single multi-row attribute group and a single row attribute group. It creates one or more invisible sub-situations and knits the results together. It does not have the logic at the TEMS to manage the two multi-row attribute group case.

There is a Light Over Here!

Given the extreme customer need, I searched for alternatives and found a way forward in the world of Mathematical Logic and Set Theory. A long time ago I was a math wonk in graduate school and still retain some of the training.

The goal is to calculate a useful result

A and B

for two multi-row attributes even though ITM does not support that.

ITM does have this construction

A *UNTIL/*SIT B

which you specify using the UNTIL tab in the Situation Editor. The logic is that if B is true [on the same managed system or Agent as A] then any situation event for A is closed and any future Situation Result for A is ignored. In set theoretic terms that is

A and (~)B

or A and not B. You can easily validate that by running through some examples on paper.

The first breakthrough idea is that A and B can use different attribute groups in Base and Until situations. Situation B cannot usually have DisplayItem set, but A can use DisplayItem and there is considerable value in that mixture.

A second set theory logic rule can now be employed

B is the same as (~)(~)B

Most people have heard it explained that a double negative is the same as a positive. That is one example.

Suppose we were looking at integers from 1 to 20. And then suppose that B had the formula that value > 10.

After B the integers in the result set would be 11,12,13,14,15,16,17,18,19,20.

In this case (~)B would be the test that value <= 10, and the results would be 1,2,3,4,5,6,7,8,9,10.

Now the reverse again (~)(~)B would again the the test that value >10 and the results would be 11,12,13,14,15,16,17,18,19,20.

So B and (~)(~)B have exactly the same results.

The original goal was to evaluate

A and B

As seen above this is identical to

A and (~)(~)B

and also from above that is now equivalent to

A *UNTIL/*SIT (~)B

Finally, (~)B will have the same result sets as a variant of B where the formula is reversed - say B_rev. So the following

A *UNTIL/*SIT B_rev is identical in function to A and B

You may want to work through some examples before continuing - in order to convince yourself.

Practical example

I titled this blog post thinking of two feuding clans - in reference to how hard it is to get two different multi-row attributes working together, However by building a wall between them [BASE/UNTIL] and just referencing each others presence we can achieve some valuable results.

There is a zip file attached with model Linux OS Agent situations which demonstrate this working ==>here<==. Following is a a presentation of the model situations.

For this example, we may have a shell file installed in a directory /tmp/lpp and the shell file is run with this command "sh /tmp/lpp/testsl.sh". The goal is to have a situation event that fires if the command is installed but is not running.

Until Situation

First is the Until clause. The formula is against the Linux File Information attribute group and the test is whether the /tmp/lpp path is missing. When it is missing, the situation will be true and that will allow the base situation to be suppressed.

Base Situation

Next is the base situation which tests if the expected process is running. It uses the Linux Process attribute. The test is whether the process "sh /tmp/lpp/testsl.sh" is missing.

In The Advanced button we see Persistence is set to 2

And that DisplayItem is specified Proc_CMD_Line happens to be the internal attribute name for Command Line. This is not strictly needed here, but is vital if more than one process was defined in the *MISSING clause.

Finally the Until tab

This is the linkage between the base situation and the until situation.

Limitations

DisplayItem cannot be usually set in the *UNTIL situation. APAR IV74758 - delivered in ITM 630 FP6 - can allow Base/Until DisplayItems in limited cases. This requires a TEMS manual configuration and a precise knowledge that the two DisplayItems are in the same internal format.

Persist=2 must be set on Base situation to avoid race conditions between base results and until results.

If the Base situation could return multiple results, DisplayItem must be defined that multiple events can be created.

Summary

How to get two multi-row attribute groups to influence each each other to gain useful information.

History and Earlier versions

If the current example situation do not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.

Sitworld: Discovering Historical Data Export Problems at Agent

Introduction

ITM 6 has a marvelous ability to collect historical data. Best practice is to collect the historical data at the TEMA or Agent and then export the data to the Warehouse Proxy Agent which then forwards the data to the data warehouse. With a large number of agents almost anything can go wrong and require fixing. Identifying the problem cases has been challenging. A few years ago a new TEMA attribute group was added to expose the last export status. This can be used in a situation formula and generate situation alerts to problem cases. This post shows exactly how to do that. An appendix at the end lists all the current status codes. In some cases you can resolve it yourself, in other cases IBM Support will be involved.

Step by Step Situation Development

Right click on a TEP navigation node such as Linux OS under a test Linux System. Select Situations... Click on the new Situation action. Enter a situation name in the dialog box. If not matching what you want, also enter Monitored Application.

Click OK. For the first experiment set the test to be == 0, meaning alert when things are working as expected.

Click on Advanced,Display Item and select Collection Identifier.

Now make sure situation is distributed to your test system and OK out. The situation should start immediately and in the Situation Event Console you will see

For the next steps you will likely want to test for Last Export Status not equal to zero. Next you will expand the distribution to more agents, like all Linux agents.

Last Export Status - what it means

The appendix has a list of all currently known export status values. 0 means success and can be ignored for this purpose.

One common one is 26

CTX_MetafileNotfound, 26

In general this means that historical data was configured, but no data was every collected. The example studied closely was some Linux LPAR attribute group, but the Linux system being looked at did not have any LPAR capability. For this site, we recommended that the attribute group should not be collected. Another way to avoid alerting would be to extend the formula to exclude the 26 status.

Other errors may lead to obvious conditions - like an inability for the agent to connect to the WPA or maybe a nearly full mount point. It any case you need to investigate and resolve... with IBM Support if need.

Other possibilites

Another common issue involves unhealthy agents - online but not responding and not running situations. Here is a blog post and program to help track them down: