Sitworld: Table of Contents

Inspiration

After the number of blog posts increases, it is harder to find a way to locate ones of interest.

The first section lists six posts I consider most important.

The second section is all the posts and very short comments.

Top 6 By Importance [My Prejudiced View]

ITM Database Health Checker
It is common to see a TEMS database [often called EIB] which has problems which cause confusion or sometimes lack of monitoring. This project identifies and documents 50+ advisories which will make things better.

Best Practice TEMS Database Backup and Recovery
The most costly support cases are when a customer does not have a proper backup. One memorable case was after a Storage Access Network device lost power and the most recent backup was over a year ago. I talk to people every day where TSM is used to make copies of the TEMS Databases and that almost every time is insufficient. This post was written jointly by a top L3 engineer and myself. If everyone did this the time to recover would drop substantially.

MS_Offline – Myth and Reality
MS_Offline type situations are extremely weighty and cause problems "at a distance". For example a recent case with 9545 agents and 22 MS_Offline situations with 5 minute sampling interval has spawned multiple IBM Support interactions. They all come back to this one issue. When Persist>1 is set, the problems are much worse. The blog photo shows a California Condor [VERY LARGE VULTURE] lurking outside a window. Treat MS_Offline type situations as dangerous creatures and you will reduce your risk of injury and pain.

TEMS Audit Process and Tool
This has been available for 4+ years. It is a perfect way to examine the dynamic impact of workload [Situations, SOAP, real time data requests,etc] on a TEMS. With that knowledge you can make changes to avoid problem conditions. I have one customer who runs this on every TEMS each weekend and if "advisory messages" are present [noted via a non-zero exit code] sends the report to an analyst for review. The rate of emergency IBM Support meetings has dropped to near zero... at least for this area.

ITM Agent Health Survey
This tool provides a view of agents which are online but possibly non-responsive. Cases like this mean that real time data response is slow and partially missing, situations are not running, historical data is not being recorded. These are things everyone should worry about. This identifies the guard dog that doesn't bark.

ITM Situation Audit
This tool performs a static analysis on all distributed situations and produces report of warning messages. It also reports which situations need TEMS filtering [instead of Agent filtering] which is a prime performance killer. Together with TEMS Audit you can really increase efficiency - reducing the cost of monitoring. This also gets early warning for situations with problems. Surprisingly, 50 of 51,000 situations studied actually had syntax errors - like VALUE instead of *VALUE. Anyway - I expect this to be an important tool over time.

The following information has been updated and included in the new TEMS Audit distribution which is documented here.

Inspiration

People mostly ignore TEMS performance issues until the TEMS crashes or agents start going wildly online and offline.Until 2010 this was a constant challenge and issues would linger for months. At the end of 2010, I worked a perfect example. The customer had a single situation that – as it happened – caused 6 AIX servers to run at 95% utilization or higher. I saw from a diagnostic trace what situation was involved and why. An ad hoc summarization program demonstrated that 93% of the incoming workload came from that one situation – on each of the six AIX systems.

To help explain this to this customer, I exported the data from the analysis program into a comma separated file. That way the customer could view a spreadsheet representation of the impact. The customer agreed to cut the situation into 3 pieces with the sum of the three situations had the identical effect. The result was that utilization on all 6 AIX systems dropped to 10% or lower… not by 10% but dropped from 95% down to 10%!! This was a case of a Too Big situation - where the WHERE clause was too large to send to the agent.

A month later I took on a long running issue. The summarization program pointed to one situation. That situation was alerting on SAP documents that were late in being processed. The situation itself was quite reasonable. The customer told me they had 10 SAP servers and it seemed that one was very overloaded and another was somewhat overloaded and eight others were under-loaded. The overloaded SAP servers had thousands of late documents. When they reconfigured the SAP workload the number of late documents decreased and the associated TEMS load was almost eliminated.

There have been several years of experimentation and improvement. The technote contains a full description of the process and a Perl program to perform the analysis. Changes introduced over time were in response to specific customer cases or to performance concerns.

A number of customer and IBM sites use TEMS Audit regularly to identify issues before experiencing a high impact.

On 20 May 2013 I published version 1.00000. That version calculates the diagnostic log segments on the fly and copies them to a work directory to handle rare cases where segments are reused. With this level, you could run this periodically using a crontab task [Linux/Unix] or an AT command [Windows] and get a regular report on upcoming issues.

On 27 July I published version 1.05000. See below for a list of improvements. The trace report was added to measure a customer TEMS environment that was accidentally set up to run with maximum tracing continuously.

On 12 August I published version 1.10000 with the Advisory section added.

I suggest everyone get this package and run it regularly. Everyone prefers to run without a crisis. Smooth running is best for everyone. If you have any interesting ideas on how to extend this work, I am always interested. Add a comment here or send an email.

The diagnostic log showed what was happening. Most cases were perfectly clear although there were surprises. For example *STR was Agent filtered but *SCAN was TEMS filtered. However *MIN and *MAX did not create a situation event, even though rows were returned to the TEMS. That was very puzzling.

Background

The idea is to examine all Linux processes with Cumulative Busy CPU above zero, select the one with the minimum such value and create an event based on that data. We will discuss later whether that can produce anything interesting. However from the workload trace there were 6 rows of data returned. Clearly one had to be a minimum but there was no event created.

I retested on a ITM 623 system and found a case where there was an event created but the data was not right. The minimum attribute was displayed correctly, but looking at all rows the rest of the data was from a non-minimum row. The Process ID, the command name etc was all wrong.

Defect or design problem?

I talked with our chief dataserver developer guru and he told me that area had not changed substantially since the Candle acquisition in June 2004. In looking closely the situation he spotted a fundamental problem in the actual SQL.

Before the agent can work with a situation the predicate must be converted to PREDICATE or SQL that represents the needed logic. That is kept in the SITDB table. Here is what the test SQL above looks like

The columns represent all the attributes which are captured when the situation formula is true [ see History Note 1 below]. All the names are in Table and Column form and not the attribute names. You can find the correspondence in the Object Definition Interface [ODI] file which the portal client uses. This would be in the docklz file and it is for the Linux OS Agent.

The SYSTEM.PARMA is how the TEMS lets the agent know Situation information like the name etc.

The last two lines are supposed to represent the *MIN logic. The GROUP BY doesn’t make much sense. It is supposed to be limited to columns which have been grouped before. In this case the ORIGINNODE or agent name is all the same so all the rows are to be examined. The HAVING BY in SQL can only reference the grouped value. This would be legal

HAVING MIN(KLZPROC.CPUSECONDS) > 10

But the comparison to current row is totally illegal SQL. At that point, you only have grouped rows and so you cannot compare row by row as this supposedly does.

Further Testing

Our Dataserver guru created a table in the TEMS and ran some SQL to populate the table and then run SQL parallel to the above. Sure enough, depending on the initial rows contents you could get 1) no rows or 2) rows with bad data or 3) sometimes rows with the right data.

The firm conclusion was that this was a design defect, but how to proceed was controversial.

Next Steps

The initial thought was that this defect should be corrected. However practical concerns weighed heavily. We would need to add a non-SQL quirk to the dataserver. That change could be expensive, since a two pass logic was required. That was a new area for the dataserver and few were comfortable with predicting how much time would be required to develop the change. There was also the serious risk of affecting existing SQL usage.

A second question was how common the usage was. Here came the biggest surprise. For the Situation Audit project I could add logic to count such cases. Redoing over 50,000 situations I found just one – not production – situation. That was a PMR July 2013 where the customer discovered it wasn’t working. In 2014 I worked with a second client that wanted to use it and experienced problems. That is a very small number of potential users.

Question on Usefulness and Efficiency

How useful would such a situation be? It is hard to make a case for usefulness. The data returned would be "on all processes the one with maximum CPU usage is xxxxx". That might be valuable for a time-sharing environment but that is rare these days. The same data can be gotten by a Portal Client workspace view on demand if it is just for interest sake. Such a situation might be interesting but not perhaps something you want a problem ticket issued on.

How efficient would such a situation be? In most cases, the situation will be true every time… because something will be minimum or maximum. Since it is TEMS Filtering, many rows will be sent to the remote TEMS on each cycle. Imagine that happening for hundreds of agents every couple minutes. This could severely impact the TEMS. Other such cases have destabilized the remote TEMS and caused outages.

In the end we decided unanimously that we should just document the current condition and leave it alone. I did add it as a Situation Audit advisory message as a warning.

Summary

History Note 1:

You won't always see all the attributes. Each attribute has a function called "cost". The maximum cost of the attributes used in the formula are used to limit the attributes asked for to attributes with that cost or lower. This was because in the initial OS/400 implementation [January 1997] some of the attributes took many minutes to retrieve. To achieve a timely response, we avoided getting them automatically unless the situation formula called it out.

This is mostly ancient history since most agents set the cost at zero. However the LocalTime attribute group has an attribute Time with a cost of 9. If you set a formula for Hours *GT 00 then the results will NOT include the Time attribute. I suspect if you closely examined the ODI for the i Series agent you would see several examples also.

Sitworld: A Situation By Any Other Name…

Inspiration

A customer was attempting to verify a new situation was working as expected. The situations were not complex and he was reviewing the TEMS and Agent operation logs to see that situations were starting and were firing as expected. The operation logs were not showing what he expected.

Background

Situations have a one or more names depending on the context where they are viewed. The simplest case is when a situation name is composed under the short name rules - which everyone should memorize.

31 characters or less

Alphanumeric or underline

First character Alpha

Last character not underline

If you are a regular expression geek, rules 2/3/4 would be expressed as

^[A-Za-z][A-Za-z0-9_]*[A-Za-z0-9]$

Before ITM 6.2, the rules were almost the same except (1) was 32 characters or less. That was the only choice for situation names.

If you create a situation under shortname rules and never rename a situation by over-typing in the Situation Editor then that is the name you will see in most circumstances. This is preserved in the TSITDESC table and is the SITNAME column.

From ITM 6.2 [14 November 2007] a new situation feature was introduced called a FULLNAME. This allowed you to have names up to 256 characters and to violate rules 2/3/4. For example the name can be up to 256 characters and have dashes or embedded blanks. To implement this new logic, the TNAME table was introduced with two columns

FULLNAME name up to 256 characters
ID index name

The ID column follows the short name rules. The TSITDESC.SITNAME column is always the same as the TNAME.ID field.

When FULLNAME is seen and not seen

The FULLNAME is always seen in the Situation Editor. If it is blank or missing the SITNAME is used,

Both SITNAME and FULLNAME are seen in the tacmd viewsit and tacmd bulkexport outputs. In simple cases, FULLNAME will be blank.

In the TEMS operations log and in the Agent operations log, the SITNAME column is the only one ever seen.

Important note: If you need to validate cases where there is a FULLNAME and an index SITNAME, you will need to get the TNAME data to translate.

The Rename Horror Show

When you have an existing situation and rename it in the Situation Editor by overtyping the name and pressing Apply, the TSITDESC.SITNAME is never!! ever!! changed and a TNAME entry is created with the new name in FULLNAME and the old name as the ID.

In this worst case scenario the Situation name seen in the TEMS and Agent operations logs would be the opposite of the ones see in the Situation Editor. That is an prime source for confusion.

If you must rename a situation but want to avoid this confusion, do a Create Another... and type in the desired name. Later delete the old name.

Typical SITNAME Index Names.

If you enter a long name in the situation editor and the first 31 characters match rules 2/3/4 then the SITNAME index will be just the first 31 characters. If there is a situation with the same SITNAME index already a Z name is created as described next.

If the SITNAME index needs to be created, it typically has the form Z and 30 numeric characters. The exact one chosen is somewhat random since the first one chosen might already exist.

FULLNAME and tacmd createsit

Some people do a tacmd viewsit -s sitname -e sitname.xml. Then they manipulate the xml file and reload it with a tacmd createsit -i sitname.xml. As you can see from the above logic, it is very important to use the Shortname rules or to keep the SITNAME and FULLNAME entries unchanged. The tacmd createsit process does not create new valid SITNAME index values.

The _Z_ Situation Names

TEMS implements a feature called Super Duper Situations. In this usage, existing situations are merged into a _Z_ type situation. That situation is run, the results are returned to the TEMS and the TEMS creates results for the original situations. In the operations log what you see is the somewhat mysterious _Z_ names. See this post Sitworld: Mixed Up Situations for a fuller explanation including how to disable that logic… which can be very useful for debugging some problem situation cases.

Results

The client deleted the problem situations and created them again following the shortname rules. Then he could validate the situations were working as expected.

Summary

This document discusses the complexity of situation names in various contexts.

Sitworld: Action Command Wars – A New Beginning

Inspiration

A customer hub TEMS was unstable, crashing a few hours after startup. The user had created a MS_Offline type situation without the critical Reason != FA test and had an action command that sent an email. See this post for gory details about MS_Offline. That situation was turned off but a week later they reported another incident of instability. The instability improved from several hours to several days but was still very bad. The environment needed a fresh look.

Background

I reviewed the situations running at the customer site. There were about 850 situations and over 200 had email action commands. I asked the customer about that and there was no event receiver installed like Omnibus. Email was the method used for alerting IT staff of conditions to be looked at. It was a relatively small environment and all emails were sent from the hub [and only] TEMS. By reviewing the TEMS operations log I could see there were occasional bursts of 50-100 emails in 5-10 seconds.

Action commands run simultaneously in the TEMS [or Agent] process space. That means they consume both process space and CPU time. If there is a large enough burst of commands that takes a lot of CPU and process space. As a direct result the TEMS can fail.

Early Solutions

To recover from a case where a situation with action commands was crashing the TEMS on startup, I documented the Running TEMS without SITMON technique. With SITMON not running, you can run some SQL usind kdstsns to change the problem situation to not run at startup.

The Simple Solution

As anyone who reads this blog knows, I have been surfing through action commands recently. In one recent post I ran an action command in a different process. The reason there was to allow the command to continue while the agent stopped and started. Reflecting on that case - a simple solution came clear: run the email commands as separate processes.

In each case the command will be run as a separate process. That means that the TEMS process space will not suffer a catastrophic increase in size and potential failure.

This is not a perfect solution. For example, there is no way to send an “condition is resolved” email. A proper event receiver like Omnibus is always preferred.

Possible Serious Side Effects

You might experience performance problems or worse.

If there are 1000 agents connected through a remote TEMS and that remote TEMS goes offline, there are 1000 agents declared offline immediately. If there is an email action command on a MS_Offline type situation, that would mean 1000 processes starting at the system running the TEMS. If the system paging space is not configured to handle the new worst case virtual storage usage a system failure might occur. Any situation author might create a situation which accidentally creates a lot of events. If there is an action command that would mean many commands running at the same time.

If the system has a limit on the total number of processes running, that might cause some action commands to not run.

The action commands would all use some CPU. That might temporarily cripple the TEMS and cause unpredictable bad effects.

Summary

This post shows how to run action commands outside of the TEMS process and thus avoid cases which might otherwise cause a TEMS failure.

Sitworld: Adding Environmental Data to Action Command Emails

John Alvord, IBM Corporation

Inspiration

A customer situation was created to detect a dangerously full paging space condition on an AIX system. The formula used the KPX Paging Space attribute and the test was like this (Used Pct > 90).

A situation action command created email sent to operations staff. They monitored such events by reading email on a smart phone. The email did not contain enough information to decide how to handle the issue. The attributes were system wide instead of specific. The smart phone did not have a remote terminal programs to logon to the system.

The customer needed to send environmental information in the body of the email.

Solution

Here is an action command to gather additional information in this case. It only applies to AIX since the svmon command is specific to that Unix version. The general scheme can be used in many more circumstances. The action command uses Linux/Unix/Windows meta characters like ( … ) to create sub-shells. Some of the information comes from the Agent attributes and some from the environment. This command looks like this as a long string. Don’t get scared because a careful explanation follows.

Sitworld: Agent Workload Audit

Draft #4 – 28 June 2015 - Level 0.82000

Inspiration

A large customer had achieved considerable stability and efficiency in the TEMS and TEPS.

They insisted the agents use less resources. In some cases agents were consuming 15-25% of a single core or more. Most of the activity was driven by situation processing, however the TEMS Audit report showed a light result workload arriving at the TEMS from the Agent.

To diagnose this condition the Agent Workload Audit tool report summarizes activity at the agent. The goal is to make measurements of Agent side processing. Just as with TEMS Audit you can focus on specific cases. You can make changes and test results. If a specific situation can be implemented more efficiently, this a good way to test. If you wants to sample every 2 minutes for a rarely occurring condition, the value of that testing can be balanced against the cost of doing so. In any case the added information reduces uncertainty and increases satisfaction.

Level 0.80000 adds Pure Situation reports. Pure situations [typically checking a log] can have extremely high but hidden impact on an agent. In the key case a client had accidentally configured very high volume logs as well as the error logs. The result was a agent that could not keep up with the input. Since the situation conditions never hit, there was no indication by events or anything else except the agent failure. The new report shows how many rows of data are evaluated and how many are passed on to the TEMS.

Level 0.81000 adds support for old TEMA levels. Some data is not available but the reports are still highly useful.

ITM Agent Workload Audit Installation

The Agent Workload Audit package includes one Perl program agentaud.pl. It is contained in a zip file here. The program has been tested in several environments using data from other environments. Windows has had the most intense testing. It was also tested on Linux. Many Perl 5 levels will be usable. Here are the details of the testing environments.

This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail)

2) Perl on Linux on Z

# perl -v

perl -v

This is perl, v5.10.0 built for s390x-linux-thread-multi

Copyright 1987-2007, Larry Wall

ITM Agent Workload Audit Configuration

The Agent Workload Audit package has controls to match installation requirements but the defaults work well. All controls are in the command line options. Following is a full list of the controls.

The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.

command

default

notes

-z

off

Log is RKLVLOG from z/OS agent

-o

agentaud.csv

Report file name

-h

<null>

Help messages

-v

off

Messages on console also

-cmdall

off

report all commands

-nohdr

off

Do not print report header files

-objid

off

Include process object ids in report

-logpath

off

Path to log files

The parameter left over is the log file name specification. That can be a single file or it can be a partial diagnostic file name. For example if a example diagnostic log name is nmp180_lz_klzagent_5421d2ef-01.log the filename specifier is nmp180_lz_klzagent_5421d2ef.

The diagnostic log segments wrap around in a regular pattern. The Agent Workload Audit calculates the correct analysis order. In some cases that order is incorrect and a manual collection mist be created. This usually shows when a values in the report show a negative time value.Agent Workload Audit Usage.

Note: The -z option for z/OS agent logs will be validated soon. You are welcome to try it now and if there are issues please contact the author. The basic logic has worked "forever" in TEMS Audit but testing is always an important step.

ITM Agent Workload Audit Usage

The Agent must be configured to capture the needed diagnostic information. See Appendix 1 below for gory details.

Linux/Unix Agent Configuration

In the <installdir>/config/XX.ini file [where XX is the agent product code, for example lz for Linux] add the following three lines

KBB_RAS1=error(unit:kraafmgr,Entry="ctira" all er)(unit:kraafira,Entry="DriveDataCollection" all er)(unit:kraafira,Entry="~ctira" all er)(unit:kraafira,Entry="InsertRow" all er)(unit:kdsflt1,Entry="FLT1_FilterRecord" all er)(unit:kdsflt1,Entry="FLT1_BeginSample" all er)(unit:kraadspt,Entry="sendDataToProxy" all er)(unit:kraatblm,Entry="resetExpireTime" all er)(unit:kglhc1c all er)(unit:kraaevst,Entry="createDispatchSitStat" flow er)(unit:kraaevxp,Entry="CreateSituationEvent" detail er)

Windows Agent Configuration

error(unit:kraafmgr,Entry="ctira" all er)(unit:kraafira,Entry="DriveDataCollection" all er)(unit:kraafira,Entry="~ctira" all er)(unit:kraafira,Entry="InsertRow" all er)(unit:kdsflt1,Entry="FLT1_FilterRecord" all er)(unit:kdsflt1,Entry="FLT1_BeginSample" all er)(unit:kraadspt,Entry="sendDataToProxy" all er)(unit:kraatblm,Entry="resetExpireTime" all er)(unit:kglhc1c all er)(unit:kraaevst,Entry="createDispatchSitStat" flow er) (unit:kraaevxp,Entry="CreateSituationEvent" detail er)

In the log segment size add 25 - meaning 25 megabytes

z/OS Agent Configuration

This is not tested yet. If you are interested please contact me.

Usage

Recycle the agent and let it run whatever workload is to be tested for an hour or two. Collect the current diagnostic log segments to a work directory and then run the report. For the log segments above this would be an example.

perl agentaud.pl -v nmp180_lz_klzagent_5421d2ef

The result will be a file agentaud.csv which contains the report.

Agent Workload Audit Sample report

Here is a screen capture of an example report. Some columns of lesser interest were hidden to make it easier to show here. Following are some comments

If a situation is *REALTIME that means it was a real time request from a TEP workspace view or a SOAP request or some other source.

What to do with the Report

Once you have a report you can see collections are impacting the workload. In the above case the top situation was defined like this

Another way it to make some preliminary checks before the expensive wild card checks. You would need to check on typical values of Base Command first but the following would make an even bigger difference.

In the first stage the comparison is very quick and entirely at the agent. [Avoid *SCAN which forces TEMS filtering]. Assuming the program is running, the *MISSING stage only runs once instead of against 100 or maybe even 1000 processes in a large unix environment. If the program is not running, the *MISSING never compares a single process.

After making changes, you can retest and measure the benefit.

Summary

History and Earlier versions

If the current version of the Agent Workload Audit tool does not work, you can try recent published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.

0.82000
Correct capture of situation name in one case. Add several new known pure event tables.

Appendex 1: TEMA Workflow and Diagnostic Trace Gory Details

This comes from research on Agent data collection processing. All the logic traced is in the TEMA [Agent Support library] component ax (Linux/Unix] or gl (WIndows). TEMA mediates between the TEMS and the data gathering function and knows about situations, historical data collection and real time data collection. Here is the diagnostic tracing needed to track the process.

Starting a collection process

At this point an Agent Workload Audit object is created to track future activity.

Objectid - 1353712387,2226127832 - the first number represents a processing object and the second number represents the TEMS the agent is connected to. The combination of values is a unique identifier for the processing object. The Agent Workload Audit [AWA] process maintains a separate instance for each such object.

Table - KLZ.LNXCPU - The Application name, KLZ for Linux OS Agent and LNXCPU for the CPU Attribute group. Each request is for a single attribute group.

Situation: Linux_High_CPU_Overload - the situation name. If absent, this represents a real time data request.

You refer to the Objectid when understanding TEMA traces. For example if an agent is connected to one TEMS, that TEMS is recycled and the agent switches to another. At that point you will see the same situation starting for the new TEMS and stopping at the old TEMS but the ObjectId is changed.

Start of Data Collecting

Each cycle a collection process timer expires. There is a calculated time for the next collection and processing has reached that point. In reality there may be a delay between planned expiration time and actual expiration time. That is a signal of an over-committed agent workload.

Note: Most of the next trace entries do not have an object id. In this case the internal pthread id which is part of the trace as in F here: 5422F619.1C20-F: is used as a temporary link between the trace entry and the objectid. That linkage is broken when the data retrieval is complete. For technical reasons it is stored with a leading T. This example represents a thread key TF.

Start of data sampling

After the data provider has supplied the data, filtering takes place. Filtering is following the Situation Formula rules to figure out what rows of data are needed. This is sometimes useful in measuring data provider times.

Filtering data

As each row is filtered a zero return means the data passed filtering. A one return means the row did not pass filtering and is therefore discarded. This is very useful to see cases where a large amount of data is retrieved and discarded - which can be a prime indicator of inefficient processing.

Saving data for the TEMS

When a row is accepted for sending to the TEMS, this trace is seen. The goal here is to capture the row size in bytes. If all rows are filtered away this is not seen. You see more than one if more then 50 rows are to be returned.

This trace summarizes the data collection process. The timeTaken is the measured elapsed seconds for the data collection. This is critical for estimating relative workload impact. The next expire time is the planned next collection time. If the next collection time starts late, that is another signal of over work at the agent.

Note: This diagnostic trace is not available at early TEMA levels and so not everything will be reported.

Sitworld: Alerting on Daylight Savings Time Truants

Introduction

Twice a year installations need to change the clocks on many systems. Identifying which systems did not have their time changed is often of great interest. This document shows you how to create a situation which will alert when an agent is running with a system time that is very different from the hub TEMS system time. Based on those events you can undertake recovery actions. This can also be useful to identify cases where a time protocol is in use but has been badly configured and is not working as expected. I had one case where the system involved had a broken hardware timer.

This is an model situation and not suitable for production use. For example if an agent is in another timezone , it would be normal for the system time to be different. During a DST transition, you might get many alerts that could be ignored. If each agent was in the same timezone as it the TEMS it reports to no extra alerts would be seen.

One important aspect of *TIME tests is that the two items tested must be in different attribute groups. If you have both sides in the same attribute group, the situation fires or does not fire somewhat randomly. That is a undocumented part of ITM and the Situation Audit project warns about it.

Step by Step Situation Development

For the attribute to test, select a single row attribute group specific to the operating environment. Here are good choices for Linux/Unix/Windows:

The test situation will be named IBM_time_test. Start with a situation on the target system and select Attribute Group Linux Machine Information and attribute Timestamp.

Next click OK and start to develop the situation

Click on column function area [left third]. The following dialog box pops up

Click on Cancel

Click on column function area and select “Compare Time to a time + or - delta”

In the selection chose Local_Time.Timestamp.

Set the delta time and select OK. I will leave the settings at + one hour for this example situation formula.

Apply and test the situation.

You will need one situation for agent ahead of the TEMS time nd another situation for agent behind TEMS time

The formula advanced display shows

( TIME(#LNXMACHIN.TIMESTAMP) > '#'Local_Time.Timestamp' + 1H')

This compares the Agent side Linux Machine Information Timestamp with the TEMS side Local_Time. For example if you wanted to alert when the agent was 30 minutes ahead of the TEMS, you would do

( TIME(#LNXMACHIN.TIMESTAMP) > '#'Local_Time.Timestamp' + 30M')

Suppose Agent is at 1:35 PM and TEMS is at 1PM. Left side Agent is 1:35pm, right side TEMS is at 1pm, TEMS timestamp + 30 minutes is at 1:30pm. Test is 1:35pm more than 1:30PM – which is true and so situation event fires.

If you wanted to alert when the agent was 30 minutes behind the TEMS.

( TIME(#LNXMACHIN.TIMESTAMP) < '#'Local_Time.Timestamp' - 30M')

Suppose Agent at 12:55 PM and TEMS at 1:30PM. Left side Agent is 12:55PM, right side TEMS is at 1:30pm, TEMS timestamp - 30 minutes is at 1:00pm. Test is 12:55 pm less than 1:00PM – which is true and so situation event fires.

Note that if the TEMS and the agents involved are in different time zones, the formula would need to adjust.

Summary

This shows how to create a report about current agents and the DST status. I decided not to include the model situation because of simplicity.

Sitworld: An Efficient Design for Starting a Background Process

Inspiration

A customer wanted to start an agent related background task. The Linux/Unix systems involved had an OS Agent installed but the customer did not have system administrator rights. They could not login and add things like crontab tasks. They could transfer scripts using either tacmd putfile or using a remote deploy process with a non-agent package.

Rejected Solution

The first design was to have an always true situation, for example a formula like

LocalTime.Time <250000

The Local Time attribute is available in most agents and certainly in OS Agents. The sampling interval would be long, like 999 days. The action command would start the process in the background using a trailing ampersand. The action command would the the same one as in the next section.

The performance problem was that a situation event would be created, potentially for 20,000+ agents. Even if it was hidden from the Portal client display [not associated] and not sent to an event receiver like Omnibus, a record had to be kept and several TEMS database tables would be a lot larger in that case. The TEMS process size would be larger and SITMON processing would be slowed a bit. The memory issue is probably more important than CPU use issue.

Efficient solution

The sample program objects discussed below are here. The solution uses a workflow policy [Edit/Workflow Editor...] That will be new to many but it is a standard part of the Portal Client and works well. Don't be afraid of the unknown!! This specific case is the simplest possible - two activities with a single link. The workflow policy export is part of the example program objects.

This efficient design uses a situation IBM_start_ibmon123 which is not auto started and an auto-started workflow policy IBM_policy_start_123 which runs one time [per agent connection]. The Situation sampling interval is set to 999 days. The situation distribution is to all the agents where the background script should be running.

When a workflow policy waits for a situation true result, the situation is started at the agents according to the situation distribution but only for evaluation and delivery of results to the workflow policy. The Policy distribution is set to the same as the agent. [The Policy distribution can include more agents but the policy activity is driven only by the incoming situation results]. This policy is correlated by managed systems, policy auto start is on and restart is off.

In this usage. the situation will never create an event [unless otherwise started as a situation]. Any situation action command is ignored. It only returns results according to its sampling interval. Since the sampling interval is 999 days, results are sent to the TEMS when started and then not again for 999 days so the performance impact on the TEMS is minimal. There is a minor cost of maintaining the situation objects at the Agent.

The performance saving no events are created or stored at the TEMSes.

Here is a screen capture of the workflow policy editor.

A typical workflow policy waits for a situation event and then a link to one or more activities – in this case a Take Action activity. This sort of action command is limited to 255 characters. By comparison a situation action command run at the agent is limited to roughly 440 characters.

The action command in this case has three requirements

Test for process already running, if yes exit

Test for script present in the expected file location, if not exit

Start script in the background.

For completeness add a *MISSING process situation running to identify the cases where the background process is not running… perhaps because the script was not yet loaded into that environment.

Here is the example action command and then a detailed explanation. In this example the process name includes “ibmon123”.

(perl $CANDLEHOME/tmp/ibmon123.pl &)
==> run the Perl program in the background [trailing &]

Windows Comment

There was no need to work out a solution for Windows. If you want to experiment, here is how to start a command in the background.

start /min cmd /c perl $CANDLE_HOME\bin\ibmon123.pl

$CANDLE_HOME is the Windows environment variable for installation path.

Workflow Policy Notes

The Policy receives results and not situation events. If you want to model the logic of a situation event, then the layout would look like this:

<wait for sit true>--><wait for sit false>
|
V
<action command>

That is not important here since the sampling interval is 999 days. However if a situation sampling interval is 5 minutes, a series of true results will end up driving a series of action commands. If you want just one command, you need to wait for the sit false case. If you want a series of commands then leave it out.

The policy Take Action command ends when the command successfully starts. For example if the command had a sleep 120, the command itself would not actually complete for two minutes plus the other command time. That is important if you need to perform two commands in a coordinated way. For that case you could follow the first action command with a Take Action Delay for maybe 150 seconds and then the second command all connected together.

In your policy testing, it is very useful to review the Agent Operations log, e.g. <agentname>:KUX.LG0 on Unix. You will see the command being performed and the start status code. Status 0 is good and non-zero is some failure. That is NOT the command exit code. Usually you need to build out complicated commands slowly. Use the echo command to a /tmp log file to watch results. On Linux/Unix the echo $? will show the most recent exit code. It takes time to thoroughly test a complex action command in line mode and then in an action command. Even the one above - which looks pretty easy - had two errors during development. Slow and careful wins the day.

A final note, the workflow policy is sensitive to DisplayItem. If multiple results are returned they are all handled in separate policy threads.

Summary

This post showed how to start a background task in Linux/Unix using ITM facilities

Kudos to Bernie Garness at Mayo Clinic who called attention to this area of inefficiency.

Sitworld: Attribute and Catalog Health Survey

Draft #5 – 12 February 2016 - Level 0.83000

Inspiration

Recently I worked with a customer that experienced into a rarely seen ITM limit. ITM uses catalog and attribute files to define the data that agents can process from their monitored environments. The TEMS reads the catalog files into a combined catalog table and the attribute files into an in storage attribute collection. These get used in Situations, Historical Data, Real Time data displays and more. This customer had added the 513th catalog file and TEMS failed during startup. Internally .cat files are known as package files and there is an absolute limit of 512 packages. With IBM Support help, the customer removed a few .cat and .atr files, reset the combined catalog file to empty and the TEMS started up just fine.

However this meant the customer was unable to install certain types of maintenance or new applications. There was an urgent need for a reliable way to identify unused catalog and attribute files.

The result is this package which calculates the unused catalog and attribute files. It also produces a health report which tells error cases like an attribute group used in a situation which is missing from any attribute files.

The Situation definition is taken either directly from the TEMS database tables TSITDESC and TNAME or indirectly from the Situation Audit project run with the -a option introduced at level 1.25000. Situation Audit data will provide a report that has fewer false advisories. The EIB tables used directly sometimes identify attribute group names incorrectly because they have approximately the correct form. Situation Audit is more precise because it performs a full syntax analysis. In actual usage either will do.

The first step will be removing unused catalog and attribute files. After that the number of advisories messages in the report will be sharply reduced.

The three data sources do not have to be used in place. You can create the data and afterwards copy it to another location for processing. You do not have to achieve perfection although removing the high impact advisories will definitely improve ITM processing reliability. Performance is not expected to change much.

Package Installation

This document uses the default install directory however you can make any wanted.

Linux/Unix systems come with Perl installed. Windows may need it installed and I use http://www.activestate.com/activeperl, community edition 5.20. No CPAN modules are needed for this package. It will likely work on many different levels. As time goes on the project will be upgraded to modern levels about once a year in the late fall.

1) A Perl program atrhealth.pl and control file atrhealth.ini - standing for Attribute and Catalog Health Survey.

2) If you use the Situation Audit capture of the sit_atr.txt file, the following files can be ignored.

3) A Windows atrsql.cmd file to run the SQL statements

4) A Linux/Unix atrsql.tar file that contains the atrsql.sh file. This avoids problems with line endings. To use untar atrsql.tar into the target directory.

5) The cmd and shell files require manual updating if the install directory is not the default.

I suggest these all program objects be placed in a single directory. For Windows you can create the tmp directory and sql subdirectory. For Linux/Unix create the sql directory.

Linux/Unix: /opt/IBM/IBM/tmp/atrhealth

Windows: c:\IBM\ITM\tmp\atrhealth

You can run this in any directory, of course.

Configuring the Attribute Health Survey Program - Initialization file

Create the atrhealth.ini file. Here is an example where the sit_atr.txt will be used.

sit_atr qa1\sit_atr.txt
attrlib atr
rkdscatl cat

sit_atr: the data supplied is the filename. In this case there is a sub-directory qa1 and the file is in that directory. This is from Windows and so the backslash character is used.

attrlib: the data supplied is a directory where all the attribute files are stored.

rkdscatl: the data supplied is a directory where all the attribute files are stored.

These can be specified as fully qualified file names to use the existing files like this

attrlib C:\IBM\ITM\cms\ATTRLIB

If the Situation data is supplied by the EIB capture, the atrhealth.ini looks like this [# is a comment character]

#sit_atr qa1\sit_atr.txt
attrlib atr
rkdscatl cat

The two EIB capture files must be in the current directory and have the name

QA1CSITF.DB.LST
QA1DNAME.DB.LST

and they should be identified automatically. If there is any confusion you can invoke atrhealth.pl with the -lst option.

Getting the Situation/Attribute Data

For the Situation Audit case install that package and use it with the -a option.

Following shows how to get the data from the EIB using supplied SQL using the atrsql.cmd or atrsql.sh files. Here is an example where the work is being done in the existing default tmp directory for Linux/Unix where the TEPS is running. If the product is not installed in the default directory. set the environment variable

Linux/Unix

a) copy atrsql.tar to /opt/IBM/ITM/tmp
b) untar -xf atrsql.tar
c) If not using default install directory configure like this: export CANDLEHOME=/opt/IBM/ITM
d) sh datasql.sh
d) The two files are created and should be moved to where the survey will be done

Here is an example where the work is being done in the existing default tmp directory for Windows where the TEPS is running.

Windows

a) c:
b) cd c:\IBM\ITM
c) md tmp
d) cd tmp
e) move the atrsql.cmd to this directory
f) If not using default install directory configure like this: SET CANDLE_HOME=c:\IBM\ITM
g) atrsql.cmd
h) The two files are created and should be moved to where the survey will be done

Running the Attribute and Catalog Health Survey

Linux/Unix

a) Following the preceding step the two files QA1CSITF.DB.LST QA1DNAME.DB.LST are already present in /opt/IBM/ITM/tmp

Screen shot of Attribute and Catalog Health Survey Report

The beginning of the report contains the version number and a count of the number of messages. That is followed by the advisory messages.

Following is the advisory message documentation.

Advisory code: ATRHEALTH1000E

Text: Attribute group name in sits[$sits] not found in attribute files

Severity: 100

Check: For every Attribute Group used in a situation, it should be defined in an attribute file.

Meaning: This is sometimes a false positive when using data directly from the EIB. For example if a Situation Formula contained "12.50" the first three characters might be mis-recognized as an attribute group. This does not occur when situation/attribute data is gotten from Situation Audit.

However if this is not the case, that means the situation will not be processed correctly.

Recovery plan: Install the needed attribute and catalog files and restart the TEMS [needed on all hub/remote TEMSes]. If the situation is no longer needed, delete it. If the situation is not autostarted, it could be ignored.

Advisory code: ATRHEALTH1001E

Check: For every Attribute Group there should be a related catalog file that defines the application and table name.

Meaning: This strongly suggests the attribute and catalog files are not installed correctly. It could mean that associated situations will not run correctly.

Recovery plan: Review the related attribute file and see what the catalog file should be. If necessary, reinstall the application support.

Advisory code: ATRHEALTH1002W

Text: Attribute group in fn[$pfns] unused in situations

Severity: 50

Check: For every Attribute Group used in a situation, check if it is used in a situation.

Meaning: This could mean the attribute group and related catalog file are unused and can be deleted. However it might be an attribute group only used in TEP workspace real time views or where situations will be created in the future.

Recovery plan: Review the attribute files and delete attribute and catalog files if not needed.

Advisory code: ATRHEALTH1003W

Text: Catalog table in fn[$pfns] unused in situations.

Severity: 50

Check: For every catalog file determine if the related attributes are used in any situation.

Meaning: This could mean the catalog file and related attribute files are unused and can be deleted. However it might be an attribute group only used in TEP workspace real time views or where situations will be created in the future.

Recovery plan:Review the catalog files and delete attribute and catalog files if not needed.

Advisory code: ATRHEALTH1005W

Text: duplicate Attribute group in files [$pfns]

Severity: 25

Check: For every Attribute Group check for duplicates

Meaning: This most often a remnant of Universal Agent or Agent Builder catalog files.

Recovery plan: Delete duplicate attribute files which are unused. This will avoid future problems with too many catalog/attribute files.

Advisory code: ATRHEALTH1006W

Text: duplicate Catalog files in files [$pfns]

Severity: 25

Check: For every Catalog file check for duplicates

Meaning:This most often a remnant of Universal Agent or Agent Builder catalog files.

Recovery plan:Delete duplicate catalog files which are unused. This will avoid future problems with too many catalog/attribute files.

Advisory code: ATRHEALTH1007W

Text: Invalid Attribute run_name at line $ll in attribute file $onefn

Severity: 40

Check: For every attribute entry check for both attribute group name and attribute name

Meaning: This was spotted in one product provided attribute file [kmc.atr]

Recovery plan:Probably nothing to worry about

Don't Panic!!!

The first time you run the report you may see many many advisories. Remember that the higher impact ones are the most important.

Most of the advisories will be related to leftover duplicates. Eliminating them will avoid future problems.

Rerun the report after making corrections. Then work through the Impact 100 Advisories. You do not need to clear up every single issue immediately..

After correcting the hub TEMS, you will need to fix the catalog and attribute files on all the remote TEMS [and FTO backup hub TEMS].

Next Step: Use Portal Client

When you think this process is complete, use the Portal Client to evaluate all the catalogs in the TEMSes. That is easily viewed in a Portal Client session. From the Enterprise navigation node

1) right click on Enterprise navigation node
2) select Managed Tivoli Enterprise Management Systems
3) In bottom left view, right Click on workspace link [before hub TEMS entry] and select Installed Catalogs
4) In the new display on right, right click in table, select Properties, click Return all rows and OK out
5) Resolve any missing or out of data application data. You can right-click export... the data to a local CSV file for easier tracking.

It is not always required to make things perfect. For example if an agent connects to only some remote TEMSes, then only the hub TEMS and those remote TEMS need the catalogs. However cases where the dates are different definitely need correction. In general correction means installing the correct application support.

When you have made all those right repeating the Attribute and Catalog survey one last time will increase confidence in the environment.

Summary

This report shows problems Attribute and Catalog files. This will make the ITM environment work more reliably.

History and Earlier versions

If the current version of the Attribute and Catalog Health Survey tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.

Sitworld: Best Practice TEMS Database Backup and Recovery

20 June 2016 - Version 1.2

Inspiration

I was working with a customer with a TEMS Database File problem. In this case some of the situations had been deleted. In other cases over the years the Database files were not accessible because the index file was inconsistent. These cases are very rare but the results can be disruptive. The hub TEMS or some remote TEMS cannot start or are running without all situations and other objects.

This document presents five best practice procedures for creating reliable database backups. It is not a reference for creating a full and complete backup including configuration and application support files. See here for that reference. This document is dated. Based on history there will me more changes in the future.

At version 1.2, five empty table zip file references were added. See Note 2 at the end.

Background for Distributed Linux/Unix/Windows platform

There are 50+ TEMS database tables and most of them are represented by indexed sequential files [QA1*.DB/IDX]. 16 of those tables contain user data such as situation definitions and distribution configuration. The IDX file links the keys of a table to the location of the related objects in the DB file. If there is an interruption in the update process, the IDX file may become inconsistent and the data unavailable.

Here are some cases which caused an interruption in the past.

TEMS crash

System shutdown without stopping TEMS [AIX system before ITM 623 FP3]

Mount point or disk full

Hardware failure where system or SAN lost power

Networking outage when writing to a NFS mount

There are certainly many more possible causes. These are just the cases I have seen over the years.

The TEMS environment variable KGLCB_FSYNC_ENABLED defaults to 1 and that decreases the chances of problems. Review the environment variables in <installdir>/logs/ms.env [or Windows <installdir>\logs\ms.env and if it is set to zero [0], you should change that setting.

These are very rare cases. When and if the problem ever hits, a recovery plan will ensure a prompt return to normal processing.

A Poor Backup Plan

While the hub TEMS is running, make a copy of the QA1* files in

Linux/Unix: <installdir>/tables/<temsname>

Windows: <installdir>\cms

That is better then nothing but it might result in an inconsistent set of tables because tables are constantly changing. With the on the fly captured files the TEMS might not even start up.

Solution 1 - No Secondary hub TMS

The simplest and easiest backup plan is to stop the hub TEMS before copying the QA1* files into a compressed tar or zip file. That ensures capturing a consistent state.

If you do that once a week during a maintenance period, you can always restore those files and have a consistent state. There is certainly a cost in doing that but the cost for an outage is much higher.

Solution 1 – Recovery

Stop the hub TEMS.

Make a pdcollect to capture the current state

Restore the QA1* files from a backup

Start the hub TEMS

At this point all objects will be restored to the time of backup.

Prepare empty table files

The next solutions require a maintenance level specific copy of all the TEMS database files representing empty files. See Note 2 for emptytable file references.

Solution 2 – Hot Backup - Valid from ITM 622 FP5

In this configuration you have two hub TEMS and but only one is used ever used as the primary hub TEMS. The other hub TEMS is started for backup purposes only. This not an actual FTO configuration but it uses FTO logic to get the job done,

The hot backup hub TEMS is configured with FTO pointed to the running hub TEMS. That has an additional required control in the KBBENV file which is in

Windows: <installdir>\cms

Linux/Unix: <installdir>/tables/<temsname

Add this line manually

MHM:HOTBACKUP=1

When the hot backup hub TEMS starts, it works to make sure that its own synchronized database files match the other hub TEMS. The first hub TEMS and all the remote TEMS are totally unaware of this usage. When the synchronization is complete [See Note 1] stop the Hot Backup hub TEMS and archive the QA1* files.

Solution 2 – Recovery

When a problem is found with TEMS database files, a recovery action is required. This case does require some hub TEMS down time.

1) Stop the usual hub TEMS if still running.

2) Configure the usual hub TEMS with FTO with the partner being the Hot Backup hub TEMS. Add in the MHM:HOTBACKUP=1 manually to the usual hub TEMS KBBENV file.

7) Start the problem hub TEMS and wait for it to synchronize with the Hot Backup hub TEMS. [Note 1]

8) Stop the Hot Backup hub TEMS

9) Stop the usual hub TEMS.

10) Configure the usual hub TEMS so it is not using FTO and remove the line MHM:HOTBACKUP=1

11) Start the usual hub TEMS.

12) Configure the Hot Standby hub TEMS to use FTO and make sure the MHM:HOTBACKUP=1 is present.

13) Verify normal operation.

Solution 3 – Fault Tolerant Option [FTO]

In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Once a week or so stop the current backup hub TEMS and copy the QA1* files into a compressed tar or zip file.

Solution 3a – Recovery When one hub TEMS is OK

After a problem is found connected with TEMS database files, a recovery action is required. Usually this is the primary hub TEMS.

1) Stop the hub TEMS with the problem [if required] and the remote TEMS tasks will switch over to the backup hub TEMS which takes on the primary role.

Solution 4 - FTO and Hot Backup - Valid from ITM 622 FP5

In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Create a third hub TEMS used only for backup purposes. The two FTO hub TEMS will be totally unaware of the backup process so normal operations are unaffected.

Use the Solution 2 documented. The hub TEMS used only for backup is configured in the "Hot Backup" mode and is configured to the usual primary hub TEMS. Before the backup process, this new TEMS is started and the TEMS database files are synchronized. See Note 1 for determining when the synchronization is complete. This will normally complete in 10-20 minutes but you could also scan the operations log file for the named messages. At that time stop the TEMS. The QA1* files are in a stable synchronized state and are sufficient to be used for a recovery.

If a recovery is needed and one hub TEMS is OK, Solution 3a recovery is sufficient.

If a recovery is needed and both usual primary hub TEMS and usual backup hub TEMS are damaged, use this solution 4 backup with the Solution 3b recovery process.

Credits

Many kudos to Richard Bennett, IBM Support L3 TEMS team lead for his extensive knowledge and his wise editing suggestions.

Summary

This document is a best practice procedure for creating a reliable backup for TEMS Database files and how to use those files in a recovery action.

History

1.2 - Added Emptytable file references.

1.1 - Added Solution 4

1.0 - Initial publication

Note 1

The TEMS operations log is located in

Windows: <installdir>\cms\kdsmain.msg

Linux/Unix: <installdir>/logs/hostname_ms_decimaltime.log

During a recovery like this there will be a long series of messages about individual objects being updated. Look for one of the following messages in the TEMS which is being recovered:

KQM0009 FTO promoted <temsname> as the acting HUB.

KQM0013 The <temsname> is now the acting HUB.

KQM0014 The <temsname> is now the standby HUB.

This message(s) occur when synchronization between the primary monitoring server and the secondary monitoring server has been completed.

Note 2: Emptytable Zip Files

The best practice backup/recovery process require emptytable files. Following are references to 5 zip files for 5 ITM maintenance levels. Each zip contains three files within

Treat these emptytable files with great care. If some hub TEMS tables were replaced, you would lose all of the custom created which was created manually over the years. That could mean an extensive outage and a lot of manual work to recover. On the other hand you can replace remote TEMS database files and the FTO backup hub TEMS files [when the TEMS is stopped] with full confidence.

The Linux/Unix files need some preparation and unpacking. They need to have the same attributes/owner/group as the files currently installed. That is accomplished with the chmod/chown/chgrp commands.

If you are using an emptytable for the first time or have any doubts about usage, involve IBM Support before any action.

Sitworld: Configuring a Stable SOAP Port

Inspiration

A customer reported tacmd login failed. The tacmd functions were an important component of their ITM automated operations. As documented in the ITM Troubleshooting Manual here, I had them stop all ITM Processes, start up the hub TEMS and then start up the other ITM processes. The customer reported that all was now well but they were very concerned about the interruption to normal operations.

Background

For the new solution skip directly to "A New Solution" below.

In the 1990s the Simple Object Access Protocol was developed – SOAP for short. In the early 2000s the acronym became just SOAP. SOAP can be used to access data over a soap service. The tacmd login and many other tacmd functions use the SOAP service running on the TEMS. Many customers have created automation solutions which use tacmd functions or SOAP in Perl or Java or any other language.

Every ITM process in the default configuration has an internal web server. The default listening ports are 1920 [HTTP] and 3661 [HTTPS]. The TEMS SOAP process is closely linked to the internal web server. If TEMS is the only ITM process running on a system, the results are straightforward. SOAP is available if the TEMS is running. SOAP is not available if the TEMS is not running. In the following only port 1920 is referenced but the same process works to use port 3661 or any other non-default port if configure that way.

When multiple ITM processes are running the situation is more complicated. Each individual web server attempts to bind to port 1920. One wins and owns the port. The other web servers fail and they connect to the winner and register information. If the 1920 owner stops, the socket connections to the winner fail and then all the processes attempt to get the port. Again there will be one winner and everyone other web server connects to the winner. When the original 1920 owner starts up again, it tries for 1920, fails, and then connects to the winner.

You can see this environment by browsing to the service index page http://server:1920. The list of services you see are the ones from the 1920 owner and include the registered data from the other web servers. If you rest the cursor on a link – like “IBM Tivoli Monitoring Web Services” for example you can see the URL in the status area. If that link has a port of 1920 – that service is running on the 1920 owner. If not it is running on one of the web services that have registered.

If there is no firewall rules involved, tacmd login works smoothly. The tacmd program reads an xml file equivalent to the service index page. Then it calculates what the actual SOAP port is and then uses it.

If there are firewall rules in place to limit which ports can be used, trouble arises. Lets say that only 1920 and 1918 are allowed ports. If the TEMS starts first, then SOAP is on port 1920 and all is well. If the TEMS is recycled port 1920 migrates to another ITM process and SOAP is on an ephemeral port. The tacmd figures out that new port and tries to use that new port. But firewall rules prevent that access.

This condition has existed since the beginning. See this topic in the Troubleshooting Guide.

A New Solution

Almost any solution is better then shutting down all ITM processes, including the hub TEMS, and then starting up again. Recently a good customer figured out a better solution: Configure the ITM processes so that only the hub TEMS uses port 1920.

First make sure the TEPS is not running on the same system as the hub TEMS. TEPS also needs the internal web server in default mode.

The hub TEMS configuration is not changed.

Look at an agent diagnostic log. This will have a name looking like this - <hostname>_<pc>_<taskname>_<hextime>-01.log, for example

Linux/Unix Solution on ITM 623 and following

For every ITM process except the TEMS, create an environment file for a permanent override. If one already exists you can just use it. For example, if you ran a Unix OS Agent you would create ux.environment. The file must be in the <installdir>/config directory and must have the same attributes/owner/group as the ux.ini file. Add to that file this line

This is the ITM development design for permanent customer configuration changes. You will need one such file for each agent on the system running the hub TEMS.

Linux/Unix Solution on ITM 622 and earlier

For every ITM process except the TEMS, create an override file for an override. For example, if you ran a Unix OS Agent you would create ux.override. The file must be in the <installdir>/config directory and must have the same attributes/owner/group as the ux.ini file. The contents of the file would be

The only difference with the ITM 623 solution are the single quotes. Also update the ini file adding a source include file like this.

. /opt/IBM/ITM/config/ux.override

That is a period followed by a space followed by the fully qualified name of the override file. The path name would be different if your installation directory was different. You could use the same file for multiple agents.

Usage Notes and Variations

After the changes have been made and all ITM processes are recycled including TEMS, only the TEMS will be listening to port 1920. That will be the permanent SOAP port. If the hub TEMS is not running then the tacmd login will fail, but that is expected and reasonable.

For a FTO environment with two hub TEMS, make this change on both hub TEMSes.

Another alternative for the non-TEMS processes would be to set the following communications string:

http_server:n ip.pipe port:1918 ip use:n ip.spipe use:n sna use:n

which stops the internal web server from even starting.

I saw a recent case where the root userid which ran the agent had a .profile which included

export KDE_TRANSPORT=HTTP:n

In this case the KDE_TRANSPORT was used and the KDC_FAMILIES was ignored. The result was that all the KDC_FAMILIES changes were ignored. This was eventually spotted by reviewing the <installdir>/logs/ux.env file which had both KDE_TRANSPORT and KDC_FAMILIES included.

Summary

This post shows how to configure a hub TEMS with a stable permanent port to access SOAP services. This change will increase the availability of SOAP services.

Sitworld: CPAN Library for Perl Projects

Draft #4 – 5 January 2015 - Level 2.00000

Introduction

The Sitworld blog posts include a number of Perl projects. Some of them use Comprehensive Perl Archive Network or CPAN packages.

If you are implementing a project on your own workstation adding the packages is a best practice.

However if you are implementing on a shared production system, that might be difficult or impossible. If you do not have internet access or are less experienced with Perl the challenges can be severe. To manage this issue this post has a zip file which contains a directory which with needed CPAN packages. See the History section following the summary if you need earlier levels. These are only several of almost 25,000 useful packages.

A Perl invocation parameter causes that directory to be searched before the system libraries. For most Sitworld projects that will run ok without changing the system installed Perl libraries.

There are two libraries. Version 1.0000 is for Perl 5.16 and version 2.00000 is for Perl 5.20. The main difference in 5.20 is absence of SOAP::Lite. The Perl 5.20 enabled packages now use LWP::UserAgent. That is needed for some future multi-threading requirements.

Limitation

This pre-system library does not work if a CPAN package contains C source code. In that case it must be compiled and such a package cannot be delivered this way. Right now the limitation applies only to the Situation Audit project.

Sitworld: Detecting and Recovering from High Agent CPU Usage

Revised 27 July 2013

Inspiration

In a customer environment of 10,000+ Unix/Linux systems, the OS agents were observed running at high CPU occasionally. The particular issue was diagnosed as APAR IV18016 – a problem where some rare types of events created during autonomous mode [not connected to a TEMS] cause high CPU when the agent eventually connects to the TEMS. The customer was at ITM 623 FP1 and the problem was resolved at ITM 623 FP2. An upgrade was some time away and the customer needed a workaround much sooner.

Solution overview

The obvious solution was to use a situation based the Process.CPU_Pct [Unix] or KLZ_Process.Busy_CPU_Pct [Linux] value to trigger a recovery action. That did not work because until ITM 6.3 the Process CPU Busy per cent value always shows as zero for the klzagent and kuxagent. A more basic algorithm was required. I am sure that in any real world situation, changes would be needed but these are tested example situations. These are example situations and action commands and will always need testing and tailoring for usefulness in a specific environment. These are a starting point for something useful.

First part – an always true situation that uses a detector action command to calculate the OS Agent CPU percent. The action command uses Linux/Unix commands present on most systems. The action command will create or erase a marker file to represent a high CPU at OS Agent condition based on a threshold.

Second part – a situation to detect the marker file and trigger an OS Agent recycle. It also deletes files to complete the recovery action.

The action commands are 95% of the interesting work and so that is presented first.

Action Command – Detector for Unix OS Agent

The design is relatively simple:

capture start CPU time of Unix OS Agent

wait 180 seconds

capture end CPU time of Unix OS Agent

calculate CPU per cent over 180 seconds

create marker file if over threshold else delete marker file

There are some compromises because the length is near the size limit of an action command. This could be avoided by working the logic into a script that runs on each system. This pure situation approach is simpler and can be turned on and off conveniently. Here is the example command followed by a full explanation.

cd $CANDLEHOME/logs;
==> make the logs directory be current directory where the marker file will be stored

(
==> Open sub shell level 1

(echo `ps -e -o cputime,args | grep -v grep |grep kuxagent`);
==> Within a sub shell, capture the current CPU time for the OS Agent. The values captured are the cputime in day-hh:mm:ss form and the args. The grep –v eliminates the line associated with the grep itself.

) |
==>After the sub shell finishes there is a standard output file with two lines and that is fed to the standard input of the next stage.

awk
==> This runs an inline program to do calculations.

'{
==> The inline awk program is delimited with single quotes. The unlabeled function reads a file - in this case the two standard input lines.

ss=substr($1,length($1)-7,8);
==> ss contains the last 8 characters of the first blank delimited word. This removes the day value in order to fit within the action command length limitations. When the CPU time crosses a day boundary, that creates a negative value later on, so one sample could be wrong. However these sort of problems are usually constant and the result is only a short delay.

getline;
==> Read the next line of input.

ee=substr($1,length($1)-7,8);
==> ee contains the last 8 characters of the first blank delimited word;

(
==> Open sub shell. The command line will be run as a separate process from the agent because of the trailing ampersand at the end. If the command ran in the agent process, it would stop after the agent stopped/

sleep 30;
==> Wait 30 seconds. This allows time for the event from the recycler situation to be sent to the TEMS.

rm -f $CANDLEHOME/logs/agt_cpu.high;
==> Delete the marker file. Without this the recycler situation would constantly recycle the OS Agent.

rm -f `find $CANDLEHOME | grep psit_.*_KUX.str`;
==> Delete the persistent situation file. The back quotes run the embedded command first and so this dodges the issue of what architecture directory the file is present in.

$CANDLEHOME/bin/itmcmd agent start ux
==> Start the Unix OS Agent

) &
==> End of subshell and the trailing ampersand to cause command line to run in the background.

Action command testing

Never [!!!] just put an action command into a situation and expect it to work. Test it thoroughly in an actual command line first. Test it in segments and echo intermediate results to the screen or a log file. When you are doing situation testing, view the Agent operations log. That is the *.LG0 file. As each command runs you will the initial characters. The status code should be zero and not anything else. Finally observe to make sure the action command is doing the right thing.

With a preexisting s action command like this, review it careful and see what changes might be needed. For example, you could chose a longer or shorter sample interval [sleep 180 and the later divide by 1.8. You also might chose a different threshold instead of 25. You might chose to store the marker file in a different location. Test thoroughly and never assume anything is correct until tested.

Situations to drive the Detector/Recycler commands

The actions commands compose 95% of the interesting ideas. The example situations are available here. I will give here the example situation name and some comments.

Usage Notes

When first used about 10% of the OS Agents was recycled once and then not again. This was viewed as an excellent outcome.

A small number of agents were constantly recycling. When examined closely the systems were in serious trouble. For example one Linux system which normally had 250-300 processes was running with 6600 processes. There was a Solaris system where the Unix OS Agent was experiencing unknown problems and IBM Support had to get involved. This identification of problem systems was also viewed as an excellent outcome.

Some weeks after these situations were adapted and implemented, the system administrators of the AIX and Linux systems noticed a significant improvement and expressed pleasure at the changes.

The detector situation events were not forwarded to any event receivers and were not associated with any Portal Client navigator nodes when running in production. That is typical of helper situations. The recycler situations events were displayed as usual.

Performance Notes

The detector situations will cause open events whenever started. With a 15 minute sampling interval this will be somewhat lower impact than the node status updates. There will be a TEMS process size increase from the open events. In this case, the balance clearly tilts to the pro side in favor since the cost of a high CPU OS Agent is severe. The workaround is temporary of course since the particular APAR fix will eventually resolve the issue.

Once things are stabilized, the detector/recycler commands might be run once a day or a week to look for new issues.

History

27 July 2013 - added example situations with action commands for

1) Solaris Global Zone

2) Solaris Local Zone and Solaris [same example]

3) HP/UX

4) AIX LPAR Global with WPARs

Note that Unix OS Agents are not supported for data collection on WPARs. They are supported for remote deploy operations only. Because of that support limitation there is no AIX/LPAR/WPAR example.

The related action commands use several added techniques to keep the size within situation action command limits. For example an awk function was used for cpu time to seconds conversion. Each environment needed slightly different command lines to achieve the desired goal but all follow the original pattern. The recycler situation commands were identical in form.

Summary

This post shows a general design to detect a specific issue on system with an ITM OS Agent and then execute a recovery action in a second situation. It is especially convenient as no external scripts needed to be added to the environment. I have used similar logic in the past to get many agents recycled with little manual effort to implement at configuration change.

Inspiration

Several months ago in this post, I documented sample situations and action commands which detected high CPU condition on Linux/Unix OS Agents and initiated a recycle of that agent. This is the Windows version of that same function.

The Linux/Unix detector/recycler solution was more complex because of an Agent issue that was corrected in ITM 6.3. Before that maintenance level the attributes which calculate process CPU time always returned zero for the main agent CPU utilization. The parallel attribute in Windows always worked as expected and so a single situation suffices. For completeness a double solution is documented later.

cd %CANDLE_HOME%\InstallITM &
==> Set current directory where KinConfig is located. In Windows OS Agent Action commands the installation directory is set in the CANDLE_HOME environment variable.

KinConfg -n -pKNT -Lnul &
==> stop the Windows OS Agent running. Using the -L flag the log is written to the nul file and discarded. The log option is needed so KinConfig will not complete until the Windows OS Agent has actually stopped.

( for /f %F in ('DIR /B /S %CANDLE_HOME%\^| findstr /r /c:.*NT_agntstat.sta /c:.*NT_thresholds.xml /c:psit.*_NT.str') do del /q %F ) &
==> Search all files in install directory for certain files that should be erased to avoid certain high CPU conditions. This is the payload which can relieve the high CPU in some cases. You could use it for any other purpose.

KinConfg -n -sKNT -Lnul
==> start the Windows OS Agent running. Using the -L flag the log is written to the nul file and discarded. This is needed so KinConfig will not complete until the Windows OS Agent has actually started.

Double Situation Solution

For completeness, this is parallel to the original scheme – a Detector situation that does the detection without using Windows OS Agent Attribute values and a Recycler situation that performs the recycle based on a marker file. It probably isn't needed for this purpose but it is an interesting model for this sort of work.

cd %CANDLE_HOME%\ogs &
==> Set current directory for where the marker file will be created. In Windows OS Agent Action commands the installation directory is set in the CANDLE_HOME environment variable.

start /w
==> start a new process, wait until it completes and then exit. All following commands operate under this control

wmic /output:wmic.out path Win32_PerfFormattedData_PerfProc_Process where (Name = 'kntcma' and PercentProcessorTime ^>= 15) get Name /format:list &
==> a command to print Windows OS Agent name if it is using more than 15% CPU. The ^ above is an escape character so the “>” will not be treated as an redirection control.

( type wmic.out | findstr /c:kntcma &&
==> type the output and use findstr to see the Windows OS Agent is present. The "type" command is required because the wmic.out is UTF format. The following && means the next command is run only if the exit code is zero.

copy /y nul agt_cpu.high ||
==> This command is run only if exit code from prior stage is zero. Create the zero length marker file to trigger a recycle.

del /q agt_cpu.high )
==> This command is run only if the exit code from prior stage is non-zero. Delete the marker file since the agent is using below 15% cpu.

cd %CANDLE_HOME%\InstallITM &
==> Set current directory for where KinConfig is located. ==> In Windows OS Agent Action commands the installation directory is set in the CANDLE_HOME environment variable.

KinConfg -n -pKNT -Lnul &
==> stop the Windows OS Agent running. Using the -L flag the log is written to the nul file and discarded. This is needed so KinConfig will not complete until the Windows OS Agent has actually stopped.

( for /f %F in ('DIR /B /S %CANDLE_HOME%\^| findstr /r /c:.*NT_agntstat.sta /c:.*NT_thresholds.xml /c:psit.*_NT.str') do del /q %F ) &
==> Search all files in install directory for certain files that should be erased to avoid certain high CPU conditions. This is the payload which can relieve the high CPU in some cases.

KinConfg -n -sKNT –Lnul
==> start the Windows OS Agent running. Using the -L flag the log is written to the nul file and discarded. This is needed so KinConfig will not complete until the Windows OS Agent has actually started.

Success Story

Customer implemented this solution and thereby alleviated a small number of cases where Windows OS Agents were using excessive CPU time. Recycling them recovered normal function in most cases. A very small number of cases had Windows systems that needed careful investigation and work.

Credits

Two Citibank people did important work on completing and testing this solution.