Sitworld: Sampled Situations and Until Situations

Sitworld: Sampled Situations and Until Situations
John Alvord, IBM Corporationjalvord@us.ibm.com

Inspiration

In 2012 I worked for months on a case where a sampled situation with Until/Situation did not work reliably. In depth tracing showed exactly what was happening. See below for details.

That deficiency has bothered me ever since. UNTIL is such a powerful concept and it was hard to have to abandon the idea.

This writeup also gives a chance to explain more about how ITM Situations really work - a submarine deep dive. This discussion only applies to sampled situations.

Special Note!!

The Until functionality stopped working at ITM 623 FP4 and ITM 630 FP2. There is an assigned APAR

IV52123 - SITUATION W/ UNTIL 'SIT' CLAUSE WILL NOT OPEN.

It will be fixed in the next planned maintenance release.

How a Situation without UNTIL works

A situation is transmitted to an agent. The agent evaluates the formula immediately and then again each sampling interval.

When the formula is false [no rows] a 0 row result is sent to the TEMS and then no other transmissions as long as the formula is false.

When the formula is true [some rows] the rows are sent to the TEMS each evaluation.

The TEMS collects the result rows in situation specific buckets.

The TEMS data server [SQL processor] is deeply involved in the process. It tracks every situation. As each sampling interval expires, the data server processes results from the situations which need evaluation. It processes each of the pending situations in some order. This is where many cell functions like *COUNT are calculated. The result of that data server logic is saved as potential situation events.

Next the SITMON process reviews the potential events. It makes heavy use of the Situation Status Cache table, for example to track whether the situation event has already been created. This is also where PERSIST is measured, for example if the author wanted 5 true results in a row before a situation event was created.

Side note: Open situations keep the TEMS working and limiting the number saves TEMS work. That can be done by changing the threshold values so only truly exceptional cases are alerted on. Increasing the sampling interval also reduces TEMS work.

How a Situation with UNTIL works

A Base situation can have a paired Until situation. It is configured in the UNTIL tab of the situation editor, Each situation sends results to the TEMS and each goes through the same logic. During SITMON processing, if a Base situation has the paired Until situation having results, the Base situation potential event is ignored. In fact if there is a open Base situation event, it is closed.

That is very powerful logic. You can subtract one result from another. The Until situation can use a different attribute group – like LocalTime for example. And when it is present it does not limit the Base situation.

If the Base/Until are from the same attribute group and if DisplayItem is set the same, results will be paired up in the natural way.

The Problem with UNTIL situations

ITM processing does not occur instantaneously.

It takes time to collect the data, filter it, transmit to the TEMS, process initially, process through the data server and then process through SITMON. When TEMS is running hot, the Until results may not be ready for SITMON processing in time. As a result you can experience a false situation event that *should* have been suppressed by the Until situation results. The same thing can occur from communication delays, agent workload, competing non-ITM workloads on the agent or TEMS side.

This false event case doesn't happen often but has been seen in the real world.

Side note: The Situation Status Cache table can have a dramatic effect on SITMON processing times. This often arises when you have pure or sampled events using DisplayItems [first 128 bytes of some other attribute]. When the DisplayItem constantly varies the Situation Status Cache table in storage cache grows and grows. In the carefully studied case a 400 megabyte cache table required 8 elapsed seconds to update a single existing row. Later on the TEMS fails so it is a really bad condition. You can always tell by viewing the QA1CSTSC.DB table. If that is over 100megs and grows you have that problem.

In 2012 that was as far as we got. We explained why the DisplayItem could not be used in this case. That one change dramatically reduced SITMON processing time and the Base/Until situation false event rate dropped to zero.

A UNTIL Situation Solution

About a year later Mike Stevens [top gun ITM TEMS development engineer] mentioned that setting Persist=2 on the base situation would resolve the issue. That gives the Until situation two evaluation cycles to get the results in place to suppress the Base situation. That dodges all but the most extreme workload and environmental problems.

The original problem continues with Pure Situations using UNTIL, but that is never a good idea and certainly has the same sort of problems as outline above.

Summary

Document a way to make Base/Until situations more resistant to creating false events.