urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbusConfigurations, tweeks and new ways of looking at things all aimed at extending OMNIbus' use space030202016-06-16T08:31:02-04:00IBM Connections - Blogsurn:lsid:ibm.com:blogs:entry-3e4b1d56-d9c8-4ce5-9a10-25803db27d5fEvent Grouping by Alarm Scope - Part Five - setting the time window7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2015-09-18T09:37:03-04:002015-09-18T09:37:03-04:00<p dir="ltr">In this blog we come to the final piece of setting up Event Grouping, namely setting the time window. The time window is important because the principle on which event grouping is based is that events that occur in the same place at the same time are likely to have the same underlying cause. If no consideration of time was made then the event grouping might easily group alarms that have different causes, and that will also be the risk if the time window is set too long. On the other hand if the time window is set too short the event grouping may fail to catch those alarms that take a little while to generate, such as performance thresholds being breached. Like ScopeID then, setting the time window requires a bit of domain knowledge.</p>
<p dir="ltr">This diagram shows how we initially saw time windows working</p>
<p dir="ltr"><a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/timewindow.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/timewindow.png" style="display: block; margin: 0px auto; text-align: center; width: 800px; height: 550px;"></img></a></p>
<p dir="ltr">&nbsp;</p>
<p dir="ltr">The actual implementation is slightly different but it still follows the principle of the first alarm setting a time window which following alarms can extend if necessary. As the container closes when the time window expires without further extension we call the OMNIbus field that contains the time window length QuietPeriod, the concept being thta the time window closes after a quiet period of that length</p>
<p dir="ltr">The default QuietPeriod is held as a property in the master.properties table and is set as 900 seconds at install. This may be too long on busy systems but can easily be edited. This default value is used when the QuietPeriod field in an event is set to the default of zero.</p>
<p dir="ltr">QuietPeriod can also be set on an alarm by alarm basis, and for that we need to consider how alarms are generated and published. These can be quite different. Some alarms can be generated through instrumentation detecting a change in state and then published immediately and unsolicited. The time between cause occurrence and reception in an event management application is short enough to be considered as near real time. Most are probably not as fast as the reporting standards required for IEC 61850 compliance in electrical substations, which requires an alarm to reach the target system 4 milliseconds after the condition arising but then few IT systems could cause things to literally melt as a substation short circuit can. However there are a large number of alarm types where arrival in OMNIbus is within seconds of the condition causing the alarm arising. Not all do though and the cause of delay varies:</p>
<ul dir="ltr">
<li>alarms that are not solicited but need to be retrieved by polling the event table inside a device will be delayed by the length of the polling cycle which is typically one minute. Most OMNIbus Telco Service Monitors (TSM) included a polling application that ran every minute</li>
<li>alarms may be delayed because a system goes through an automatic retry or reset process before reporting an alarm, this too may add up to a minute&#39;s delay before appearing in OMNIbus</li>
<li>alarms that are generated by testing a sample of data, for example bit error rate test (BERT) alarms will be delayed by the length of the sampling period</li>
<li>alarms that are created by external performance monitoring systems reporting a counter or delta between counter values has exceeded a threshold will be delayed by the delta period. This can be significantly delayed with polling cycles of 15, 30 minutes or even an hour being common</li>
</ul>
<p dir="ltr">QuietPeriod needs to be set so that a container does not miss these delayed alarms, though with the last case a different approach may ultimately be needed.</p>
<p dir="ltr">The other consideration is how long it might take for the impact of an alarm to be felt. Datalinks may fail but if there is redundancy the impact may only be congestion detected an hour later. Another common delayed impact is when a server or equipment rack switches to battery power when the mains power fails. Only when the batteries drain an hour or more later will any impact be detected. However given that in such a case the mains failure is almost certainly the underlying cause of the incident we want the battery back up alarm inside the event container and not on its own in a different one</p>
<p dir="ltr">A final consideration is the likelihood that a particular alarm is reporting a condition that will trigger other alarms. If the likelihood is high then the QuietPeriod should be set long enough to catch these symptom alarms, but an alarm that is clearly a symptom should not extend the time window. Nor should a Resolution event, any event where Type=2 should have a QuietPeriod = 1 (0 triggers the default of 900, remember)</p>
<p dir="ltr">Each environment will be different, but here I suggest this rule of thumb:</p>
<ul dir="ltr">
<li>Likely cause alarms have QuietPeriod = 120</li>
<li>Possible cause alarms have QuietPeriod = 60</li>
<li>Symptom alarms have QuietPeriod = 1</li>
<li>Resolution events have QuietPeriod = 1</li>
<li>Environmental alarms where the impact is likely to be delayed (e.g. power fail, fan failure) have QuietPeriod = 900 or more</li>
</ul>
<p dir="ltr">In the next blog I will describe how this EventGrouping has been implemented in the OMNIbus FixPacks.</p>
<p dir="ltr">&nbsp;</p>In this blog we come to the final piece of setting up Event Grouping, namely setting the time window. The time window is important because the principle on which event grouping is based is that events that occur in the same place at the same time are likely to...01832urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-47a4b60c-7ce7-4fdc-b8df-14a80a9b9742Event Grouping by Alarm Scope - Part Four - determining ScopeID7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2015-09-11T12:41:27-04:002015-09-11T12:41:27-04:00<p dir="ltr">In my previous <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/Event_Grouping_by_Alarm_Scope_Part_Three_the_implementation?lang=en">blog</a> I described how generic event grouping has been implemented in OMNIbus. OMNIbus v8.1 Fix Pack 4 has now been released so what I described is now generally available. In this blog I want to expand on Scope ID and the considerations needed to populate that field in OMNIbus events.</p>
<p dir="ltr">The ScopeID field has to be populated as the automations pass over events with a blank ScopeID. However ScopeID does not have to be populated on insert so post-insert event enrichment via Netcool Impact can be implemented. While the intention was to make this event grouping an out of the box feature, selecting a suitable entry for ScopeID is not that constant. It will vary from technology to technology, from vendor to vendor and from customer to customer. What I propose to do here is set out some guidelines.</p>
<p dir="ltr">Some networks are nicely hierarchical and here it is relatively easy to select a ScopeID. A GSM or 3G radio access network for example has cell sites logically grouped around a base station controller so the base station controller is a good candidate for ScopeID. If the network operator has adopted the 3GPP standard naming convention it is also easy to extract the BSC name from the node name of the alarming device.This gives some quite logical groupings</p>
<p dir="ltr"><br>
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/nms2000sampless.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/nms2000sampless.png" style=" width:100%; display:block; margin: 0 auto;text-align: center;"></img></a></p>
<p dir="ltr">Here the twisties are opened, but initially all this would be listed under a single line.</p>
<p dir="ltr">Now we don&#39;t know whether the two sites are connected but there is a strong possibility that they are. Neither can we be certain that a PCM failure has caused those BERT alarms, but again there is a strong possibility that it did. The possibilities are certainly strong enough for a decision to cut just one ticket is the sensible one.</p>
<p dir="ltr">&nbsp;</p>
<p dir="ltr">Networks that are not so hierarchical are a different matter though. To select a ScopeID we need to remember the principles. How far can the impact of an alarm be felt and can we have any symptoms or knock-on alarms grouped under the same ScopeID. It&#39;s also worth considering where the common root cause of a collection of alarms might lie.</p>
<p dir="ltr">If the main source of our alarms is customer premises equipment in domestic properties, such as cable modems or smart meters, then a postal or zip code lifted from the inventory will work well. Power and communications cables tend to go along streets, postal codes are allocated to streets therefore a break in a cable will be felt along a street rather than more diffusely over a district.</p>
<p dir="ltr">Enterprises might choose to scope alarms by the user service, for example grouping together the servers that deliver a service. That will allow a service manager to see whether a problem is common to a number of servers. Another approach, if the information is available, is to group by alarm class. Remember the grouping also involves a time window. A Communications group alarm with four site members may point at an underlying fiber problem, an Equipment group alarm may contain both the servers affected and the PSU that has failed, a Configuration group alarm may show the sites affected by a botched change.</p>
<p dir="ltr">Another possibility is to treat a large node as the ScopeID and sub-divide it by alarm class. The sub-division may use the SiteName field but there is no reasosn why SiteName<strong> has</strong> to be a location. For large routers and similar that are very chatty this provides a measure of summarising that can be very useful</p>
<p dir="ltr"><a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/5620SS.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/5620SS.png" style=" width:100%; display:block; margin: 0 auto;text-align: center;"></img></a></p>
<p dir="ltr">Here the summary of the lower level containers reports the normalised alarm description of the highest cause code alarm.</p>
<p dir="ltr">Setting ScopeID requires a knowledge of the domain being managed. As does setting the length of the time window, which will be the topic next time</p>In my previous blog I described how generic event grouping has been implemented in OMNIbus. OMNIbus v8.1 Fix Pack 4 has now been released so what I described is now generally available. In this blog I want to expand on Scope ID and the considerations needed to...002015urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-35aa3668-8c8c-46ff-b380-5af0997b2f8aEvent Grouping by Alarm Scope - Part Three - the implementation7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2015-09-01T11:53:54-04:002015-09-01T11:53:54-04:00<p dir="ltr">In my last <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/Event_Grouping_by_Alarm_Scope_Part_Two_the_theory?lang=en">blog</a> describing a new approach to event grouping based on the assumption that alarms that occur at the same time in the same place probably have the same underlying cause, I ended by saying event grouping by alarm scope needed to define three things:</p>
<ul dir="ltr">
<li>define what same time meant,</li>
<li>define what same place meant</li>
<li>select which of the group of alarms sharing the above two attributes is the most likely cause.</li>
</ul>
<p dir="ltr">In this blog I will go into more detail how we implemented this in OMNIbus fix packs. However before I do that here is a&nbsp;<a href="https://www.ibm.com/developerworks/community/blogs/roller-ui/authoring/uploadFiles.do?weblog=More_from_OMNIbus&amp;path=Event%20Grouping%20Files&amp;lang=en">link</a> to the Object Server sql file included in Fix Pack 4</p>
<p dir="ltr">I will deal with &quot;same place&quot; first. As I said in an earlier blog, we started this work in response to a request from a major Asian telco. Most of their use cases revolved around correlating infrastructure and environmental alarms - power and air conditioning - to networking problems. One example was that if a cell site alarmed that it was going over to battery operation because of mains power failure and an hour later the cell site goes off air that these alarms were linked. It is after all a very reasonable assumption that the backup batteries were drained after an hour and that is the reason for the cell site being down. Similarly it was expected that a fan failure would see equipment cabinet temperatures rising and that as a result communications links might start clocking up framing errors or bit error rate test threshold alarms. So the first step was to populate alarms with the node location&#39;s site name. Ideally SiteName would be a unique identifier and would be included among the tokens sent by the element management system, as indeed it is in most cases. Failing that a lookup statement or an Impact policy could enrich the alarm from some inventory file.</p>
<p dir="ltr">Grouping by site name does not however bring in alarms from other sites that are related. One site might have a cabinet power failure and as a result there is a communications link failure. The site at the other end of that link will also generate alarms, for example Loss of Signal or Loss of Frame (or both) but as this is a different site those alarms won&#39;t be in the event grouping. This is where the concept of scope comes in. We define alarm scope as being the extent to which the impact of an alarm can be felt. Thus, as a comms link failure can be detected at the remote end of the link, the scope of that alarm should cover both A and B end sites. It should also cover the link itself, because that link might be based on transmission equipment that is supporting other links that are also in alarm. This means the choice for scope ID might be wider than just two sites.</p>
<p dir="ltr">Two new fields have been created in OMNIbus, @SiteName and @ScopeID. The diagram below represents how a GSM/3G network RAN might be presented in this way.</p>
<p dir="ltr" style="text-align: center;"><a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/ScopeID.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/ScopeID.png" style="width: 600px; display: block; margin: 0px auto; text-align: center;"></img></a></p>
<p dir="ltr">In practice then for a cellular network the ScopeID can be set to the BSC name, and in many instances that name can be extracted out of the Fully Distinguished Name that is in the Alarm itself. A typical 3GPP standards compliant DN might read: &quot;GSM/BSC-43141/BCF-11/BTS-11/TRX-10&quot;. It&#39;s a simple task to extract &quot;BSC-43141&quot; out of there and use that to populate ScopeID</p>
<p dir="ltr">In real cellular networks though there might be multiple management domains using different inventories and naming conventions so we have provided a third level of scoping called ScopeAlias so that the same scope called by different names can be linked together, as in this example:</p>
<p dir="ltr" style="text-align: center;"><a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/ScopeAlias.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/ScopeAlias.png" style="width: 600px; display: block; margin: 0px auto; text-align: center;"></img></a></p>
<p dir="ltr">Scope Aliasing is implemented using a custom table in OMNIbus to link ScopeIDs</p>
<p dir="ltr">These examples are for GSM/3G networks. Other domains require a different approach. I will come back to determining ScopeID and Scope Alias in a later blog.</p>
<p dir="ltr">&nbsp;</p>
<p dir="ltr">The next step is to define &quot;same time&quot;. To do that let&#39;s reflect on how events and alarms are generated and sent. A network card may apparently send a &quot;link down&quot; event instantaneously after a cable is pulled, but in reality what has happened is that the card has detected certain control signals - which may be as simple as a voltage on a pin - are no longer present. That will take a millisecond or two to detect. At the other end of the link the network card has all its physical level indicators still working but the card has detected that the logical framing of the carrier signal is no longer present. It may be many seconds before the automatic resynchronisation processes have been tried and failed thus many seconds before the alarm is generated. Often though the physical problem is more of a dirty joint than a clean break and in those circumstances a distant end may detect the problem through increased errors in a background error rate test, a test that takes minutes to run. Or the errors may be detected by an performance management application collecting SNMP metrics every fifteen minutes. &quot;Same time&quot; therefore has to be a time window rather than a fixed time.</p>
<p dir="ltr">The way this is implemented in this Event Grouping automation is to define a quiet period. This is a period after the first alarm when new alarms can be added to the container. If that period is quiet, i.e. no new alarms are added, then the container is closed. Quiet period can be defined by alarm type and if no quiet period is defined then a default is held as a property. This is set to fifteen minutes in the initial installation but most users will want to reduce that.</p>
<p dir="ltr">If a new alarm comes in during the quiet period it can extend the time window if the alarm requires that.</p>
<p dir="ltr">As Quiet Period can be defined in the rules file it makes sense to set this according to the type of alarm. This can be fairly long, an alarm reporting a device has switched to battery power should be prepared to keep the container open for the hour or two it takes to drain the battery because that is how long it will take for other effects to be noticed. On the other hand low priority symptom alarms should not extend the quiet period and can be given a QuietPeriod of 1 second - not zero as that triggers the default to be applied.</p>
<p dir="ltr">The remaining question is which alarm is pointing out the underlying cause of the problem. In a previous blog I wrote about different techniques used historically and the upsides and downsides. What we are doing here is a simplified codebook approach. Rather than score all the possible alarms against each other or create loads of cause and effect relationships we have simply given each alarm a weighting, and as this is an integer determining which is the highest weighted alarm in a group is easy and efficient. And rather than do this for potentially hundreds of alarm types we have defined sixteen generic alarm types and in the rules file we map the vendor alarm codes to these. Our initial normalised alarm class list is as follows:</p>
<p dir="ltr" style="text-align: center;">&nbsp;</p>
<table border="0" cellspacing="0" cols="5" dir="ltr" frame="VOID" rules="NONE">
<colgroup>
<col width="30"></col>
<col width="49"></col>
<col width="264"></col>
<col width="37"></col>
<col width="37"></col>
</colgroup>
<tbody>
<tr>
<td align="LEFT" height="27" style="border-top: 5px solid #000000; border-bottom: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP" width="30">&nbsp;</td>
<td align="CENTER" rowspan="2" style="border-top: 5px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" width="49">Normalised Alarm Code</td>
<td align="LEFT" style="border-top: 5px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP" width="264">&nbsp;</td>
<td align="CENTER" rowspan="2" style="border-top: 5px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" width="37">Cause</td>
<td align="CENTER" rowspan="2" style="border-top: 5px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" width="37">Impact</td>
</tr>
<tr>
<td align="LEFT" height="64" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP">&nbsp;</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Description</td>
</tr>
<tr>
<td align="LEFT" height="133" rowspan="4" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP">Physical</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">160</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Control Shut Down</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">160</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">10</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">150</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Power Loss</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">150</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">20</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">140</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Catastrophic Failure</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">140</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">30</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">130</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">General Failure</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">130</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">40</td>
</tr>
<tr>
<td align="LEFT" height="133" rowspan="4" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP">Sensor</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">120</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Environmental Warning, inc Door Open and similar alarms</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">120</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">50</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">110</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Performance Failure</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">110</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">60</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">100</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Performance Degradation</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">100</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">70</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">90</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Performance Warning</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">90</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">80</td>
</tr>
<tr>
<td align="LEFT" height="133" rowspan="4" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP">Operational</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">80</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Inoperative State, Change of State</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">80</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">120</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">70</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Heartbeat Loss</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">70</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">90</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">60</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Control Path Loss</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">60</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">100</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">50</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Operational Warning, inc running on backup</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">50</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">110</td>
</tr>
<tr>
<td align="LEFT" height="133" rowspan="4" style="border-top: 3px solid #000000; border-left: 5px solid #000000; border-right: 3px solid #000000" valign="TOP">Functional</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">40</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Non-Functional</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">40</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">160</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">30</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Missing Component</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">30</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">150</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">20</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Workarounds in execution</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">20</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 3px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">140</td>
</tr>
<tr>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 5px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">0</td>
<td align="LEFT" style="border-top: 3px solid #000000; border-bottom: 5px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">Informational events</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 5px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">0</td>
<td align="CENTER" style="border-top: 3px solid #000000; border-bottom: 5px solid #000000; border-left: 3px solid #000000; border-right: 3px solid #000000" valign="TOP">130</td>
</tr>
</tbody>
</table>
<p dir="ltr">&nbsp;</p>
<p dir="ltr">The recommended way of implementing the necessary rules file changes is as follows:</p>
<ol dir="ltr">
<li>Create a rules include file mapping vendor alarm types to Normalised Alarm Codes, OSI levels and setting individual Quiet Period times</li>
<li>Acquire a copy of the genericcorr.common.include file. This file calculates the cause and impact weightings generically</li>
<li>Add lines at the bottom of the existing rules to include the vendor-specific mapping and the generic.common files - in that order</li>
<li>Reload the probe rules.</li>
</ol>
<p dir="ltr">That covers the changes needed to rules files. Next time I&#39;ll cover setting up Event Views and provide examples</p>In my last blog describing a new approach to event grouping based on the assumption that alarms that occur at the same time in the same place probably have the same underlying cause, I ended by saying event grouping by alarm scope needed to define three...001005urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-80def8f2-4503-4b85-9c5d-e9bc6386e050Event Grouping by Alarm Scope - Part Two - the theory7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2015-08-25T12:30:22-04:002015-08-25T12:30:22-04:00<p dir="ltr">It&#39;s a common situation. A pile of alarms and events hit the event management console and the network operator is faced with the questions &quot;what happened here?&quot; and &quot;what do I need to fix?&quot;. That information is probably in those alarms but will the operator have time to dig it out before the next wave strikes? In telcos and major enterprises that is rarely the case. Over the years a number of event management strategies have been developed to direct operators to those alarms that have the answers to those two questions.</p>
<p dir="ltr">The simplest approach was just filter out the crap. If an event is purely informational then discard it, if it is reporting a failed component then let it through but hide it from operators unless it is of the highest severity. Back in the mid-noughties I was speaking with a tier 1 telcos NOC manager who said their policy was that first line operators dealt with critical alarms first, then they started on the major alarms and finally the minor alarms, only the operators never finished dealing with the critical alarms. The biggest weakness of this approach is that alarm severity is defined by a software engineer working for the equipment vendor, and severity is only relevant in terms of that piece of equipment. A dev manager at a major NEP also told me once that budget was more important than technology when it came to instrumenting equipment, which is another factor to consider. I think we can conclude that simple filtering has had its day.</p>
<p dir="ltr">Another approach has been to try and identify cause and effect, what many people call root cause analysis. These days I consider that any sales person who utters the acronym RCA in front of a customer has committed a hanging offence, as that term now means so many different things to different people. At the very least we should qualify what sort of analysis we mean when we say the dreaded TLA &quot;RCA&quot;. There are three basic types.</p>
<p dir="ltr">The first is Topology RCA, which is what Tivoli&#39;s Network Manager does. The principle is that if you can discover how a network is put together then you can determine which alarms are related to each other and from that which is the probable root cause alarm. Initially Network Manager worked solely with connectivity alarms - unsuccessful ping tests - and the analysis was to find the common point of failure, but the basic principle has been extended to other classes of alarms. This approach does require a topology model though, and these days the models can be very complex with different layers that need stitching together</p>
<p dir="ltr">Closely related to Topology RCA is Containment RCA. This is what the AdvCorr automations included in OMNIbus does. I&#39;ve blogged about AdvCorr before (<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/what_s_in_a_name_correlation_by_exploiting_naming_conventions?lang=en">&nbsp;What&#39;s&nbsp;in&nbsp;a&nbsp;name&nbsp;-&nbsp;correlation&nbsp;by&nbsp;exploiting&nbsp;naming&nbsp;conventions </a>) but to summarise here, AdvCorr applies two tests - could the alarm be a root cause or is it always just a symptom, and, does the reporting node contain other nodes in alarm or is it a member of a container headed by another node in alarm as the parent. The containment is usually physical - ports on a card in a rack - but can be logical such as the GSM network example of my earlier blog. AdvCorr is neat but alarms do need to hold the information needed to determine containment and that is not always the case.</p>
<p dir="ltr">The third approach is to attempt to tabulate cause and effect. If a cellular base station controller reports a control channel failure to a cell and an uplink failure the the same cell together we can say that the latter caused the former. This can be done for other pairs or trios of alarms, even for larger groupings. An entire &quot;codebook&quot; of alarm causation can be assembled. It will however be a very big book given the numbers of alarm types defined these days. There are over 6000 defined alarms for Alcatel 5620 SAM for example, Huawei alarms are also listed in thousands. Because of this codebook systems have fallen out of favour.</p>
<p dir="ltr">All these approaches aim for a high degree of certainty, but is that possible in cases where not all devices are fully instrumented, where not all alarms occur in the same managed domains, where the delivery of some alarms may be impeded by the very alarm condition they are trying to report.</p>
<p dir="ltr">Late last year we were presented with an alarm cause analysis problem by a large Asian telco. The customer was leaning towards a codebook type solution but we felt that while that was feasible for the three or four scenarios cited, it was not something that would scale and that adding more and more scenarios would make the system unwieldy. We therefore decided to go back to first principles</p>
<p dir="ltr">We started by deciding that our approach would be more one of guidance than direction, and that we would only alter alarms by adding some extra fields that we would use for setting up event relationships in the WebGui Event Viewer. Adding extra fields also meant we would not risk any automations a customer had already implemented. We also decided that amendments to rulesfiles would be in the form of include files so that the changes to existing rules files would be limited to one or two include statements at the bottom plus, possibly, a lookup table definition at the top. Customer&#39;s modifications to standard rules files would be safe.</p>
<p dir="ltr">We then made the following assumption - alarms that occur at the same time in the same place probably have the same underlying cause. If that was our assumption then we had to do three things, define what same time meant, define what same place meant and which of the group of alarms sharing those attributes is the most likely cause.</p>
<p dir="ltr">Same time could not mean exactly the same time because some alarms are not generated until a condition has existed for a while. An x in y policy might be in place for example, or more commonly an alarm is the result of a regular monitoring period exceeding a threshold for errors. Same time therefore means within the same time window.</p>
<p dir="ltr">Same place could mean within the same node, or within the same site but that would not cover communications alarms, most commonly when both ends of a circuit report a problem on the link connecting them. Therefore we developed the concept of alarm scope. The scope of an alarm extends as far as the impact of that alarm might be felt, for example the scope of that communications alarm would be the two sites connected together and the link between them.</p>
<p dir="ltr">For determining likely cause we decided on defining generic cause types weighted differently and our probable cause analysis involved picking the highest weighted cause.</p>
<p dir="ltr">That is the theory. In the next blog I will go into more detail on how this has been implemented.</p>It&#39;s a common situation. A pile of alarms and events hit the event management console and the network operator is faced with the questions &quot;what happened here?&quot; and &quot;what do I need to fix?&quot;. That information is probably in those alarms...00817urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-e89bda4a-cc03-43b0-b254-2dc24f8a6dbbEvent Grouping by Alarm Scope - Part One7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2015-08-24T10:18:45-04:002015-08-25T12:52:21-04:00<p dir="ltr">One ticket per incident is a Holy Grail in Network Operations. Being able to correlate a stream of alarms sharing the same underlying cause into a single line entry and using that to drive the problem resolution systems is something telcos and major enterprises have looked for for years.</p>
<p dir="ltr">Consider the economics of event generation and management. If it costs $100 to instrument a router or access point and that device now send on average one event every ten minutes then the cost per event, averaged over a year, is a fifth of a cent.</p>
<p dir="ltr">If those events go to an event management system costing $200,000 amortised over two years with $20,000 pa staff costs and handling 15,000 events per day, then each event costs about two cents to handle</p>
<p dir="ltr">However each event that causes a ticket to be opened starts costing serious money. If we estimate an average of 30 minutes work by a level one tech on $9 an hour before being handed on to a level 2 tech on $18 an hour who spends two hours on the problem then we are talking of $40 per incident. That&#39;s acceptable if the ticket cites a real problem, however a ticket that is opened in duplicate or that requires no action can still cost $5 to process. A dozen of those a day and we could be talking of $20,000 a year spent on essentially useless work. Or to put it another way, one network technician wasting their time all year.</p>
<p dir="ltr">Clearly, cutting down on unnecessary ticket cutting is a money saver.</p>
<p dir="ltr">It&#39;s also a time saver and that might mean network operators have time to look beyond the red flood of critical events. Event severity is normally set by the equipment vendor and they set severity according to the demands of the hardware. However what is critical to the workings of one box may not be critical to the end user. Not only that but the underlying cause of a critical event may be reported as a lower priority event. Look at these events from a real 3G network (node names are anonymised)</p>
<p dir="ltr"><a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/3G-eventview.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/3G-eventview.png" style="width: 800px; display: block; margin: 0px auto; text-align: center;"></img></a></p>
<p dir="ltr">&nbsp;</p>
<p dir="ltr">Both critical events are indeed critical (for those unfamiliar with cellular networks &quot;BCCH Missing&quot; means the control channel to a cell is quiet, i.e. the cell is dead), but neither is reporting the underlying cause. The underlying causes are in each case the severity 4 alarm listed above them</p>
<p dir="ltr">Now I&#39;ve blogged about event correlation before but now that a new automation for event grouping has been included with OMNIbus, in Fix Pack 3 for v8.1 and an improved version in FP4 out shortly, this is a good time to revisit the topic. Next time I&#39;ll cover historical techniques and introduce this new event grouping feature</p>
<p dir="ltr">&nbsp;</p>One ticket per incident is a Holy Grail in Network Operations. Being able to correlate a stream of alarms sharing the same underlying cause into a single line entry and using that to drive the problem resolution systems is something telcos and major...00877urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-cd7d86d0-9969-49a4-9bd6-037b9f428389Getting more out of OMNIbus - customising the simnet probe for other industries7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-06-05T03:09:09-04:002014-06-05T03:09:09-04:00<p dir="ltr">
Network status visualisation has come a long way since the days when an event list was the only way to view alarms. With the imminent arrival of new versions of OMNIbus and WebGUI the quality of visualisation available is about to take another big step forward and this brings with it a new problem. As operators&#39; displays can now be highly sophisticated visualisations it becomes more important to demonstrate new offerings with potential users and test whether they meet requirements and are usable.</p>
<p dir="ltr">
One of the first probes most of us come across, at least if we do a formal training course, is the humble simnet probe. This probe may serve no practical purpose in a real world operational set up, but it is a useful tool for testing and demonstrating visualisations. Or at least it would do if it could be configured with more realistic alarms than the rather 1990s datacentre ones it comes with. This blog describes how I configured the simnet probe to simulate smartmeter alarms.</p>
<p dir="ltr">
A little bit of simnet probe 101 here. The probe automatically generates four alarm types and the nodes and the alarm type the simnet probe will generate for them is defined in a definitions file, simnet.def by default. The probe reads through this definitions file and generates events which are then manipulated by a rules file in the usual way before being inserted in the Object Server alerts.status table.</p>
<p dir="ltr">
The four alarm types are Link Down/Up, Node offline/online, Diskspace alert and port failure. However for our purposes it is important to distinguish how each of these alarm types differs in execution. The table below sets things out</p>
<table border="1" dir="ltr" height="109" style="width: 633px;" width="630">
<tbody>
<tr>
<td style="width: 42px;">
vtype</td>
<td style="width: 137px;">
Alarm type</td>
<td style="width: 380px;">
Alarm execution</td>
</tr>
<tr>
<td style="width: 42px; text-align: center;">
0</td>
<td style="width: 137px;">
Link Down/Up</td>
<td style="width: 380px;">
An alarm is generated followed some time later by a clear alarm. If the rules file is set up correctly this will demonstrate the generic clear automation as well as put alarms into the system</td>
</tr>
<tr>
<td style="width: 42px; text-align: center;">
1</td>
<td style="width: 137px;">
Node offline/online</td>
<td style="width: 380px;">
An alarm is generated followed shortly by a second alarm. Typically the second alarm is not treated as a clear</td>
</tr>
<tr>
<td style="width: 42px; text-align: center;">
2</td>
<td style="width: 137px;">
Diskspace Alert</td>
<td style="width: 380px;">
An alarm is created with a random integer field of between 75 and 100 which is used to simulate % disk space utilisation</td>
</tr>
<tr>
<td style="width: 42px; text-align: center;">
3</td>
<td style="width: 137px;">
Port Failure</td>
<td style="width: 380px;">
An alarm is generated with a random integer field of between 1 and 8, which is used to simulate port failures on the sort of semi-intelligent switches in use c1995.</td>
</tr>
</tbody>
</table>
<p dir="ltr">
The simnet probe generates tokens which are then assigned to Object Server fields by the rules file. This is where we can step in and change those assignments to achieve more realistic alarms for our demonstrations.</p>
<p dir="ltr">
I wanted to create a simulation of a summary console of a electrical supplier whose customers have been equipped with smartmeters. I had the documentation from one of our partners to guide me to typical alarm types. From that I could select suitable alarms and map them to the ones the simnet probe was going to provide</p>
<table border="1" dir="ltr" height="124" style="width: 747px;" width="764">
<tbody>
<tr>
<td style="width: 43px;">
vtype</td>
<td style="width: 252px;">
Alarm type</td>
<td style="width: 416px;">
Notes</td>
</tr>
<tr>
<td style="width: 43px;">
0</td>
<td style="width: 252px;">
Primary Power Fail/Primary Power restored</td>
<td style="width: 416px;">
An alarm sent when the main electrical power has failed. Typically the fail alarm is a last gasp alarm and may or may not get through, so it is not unusual for there to be restore events with no corresponding fail alarm.</td>
</tr>
<tr>
<td style="width: 43px;">
1</td>
<td style="width: 252px;">
Security alarms, e.g. tamper, inversion</td>
<td style="width: 416px;">
Other alarms that may clear.</td>
</tr>
<tr>
<td style="width: 43px;">
2</td>
<td style="width: 252px;">
Communications delays, power levels</td>
<td style="width: 416px;">
Alarms that require a metric in them, for example round trip delays</td>
</tr>
<tr>
<td style="width: 43px;">
3</td>
<td style="width: 252px;">
System Alarms</td>
<td style="width: 416px;">
<p>
Alarms that might be sent when the smartmeter&#39;s internal software detects a problem. Eight possible alarms were selected:</p>
<ol>
<li>
Clock Error Detected</li>
<li>
Checksum Error</li>
<li>
Config Error Detected</li>
<li>
Fatal Error</li>
<li>
Low Battery Detected</li>
<li>
Temperature Threshold Breached</li>
<li>
Demand Overload Detected</li>
<li>
Measurement Error Detected</li>
</ol>
</td>
</tr>
</tbody>
</table>
<p dir="ltr">
Having set out a plan, the first step was to create a definition file, smartmeter.simnet.def. As part of the pitch for a Netcool solution in the smartmeter space is its ability to scale, this definition file needs to be biggish, at least 500 lines. I created it in (ahem) Microsoft Excel. Excel has features such as fill, random numbers and sort which meant I could create a long list of Meter ID numbers, assign random vtype numbers and probabilities and then later sort them for adding enrichment fields for a lookup file. The spreadsheet file is then saved as a text file. Other spreadsheet programs probably have the same features.</p>
<p dir="ltr">
My def file starts something like this:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">2438492775 2 79<br />
2438492798 2 82<br />
2438492821 3 28<br />
2438492844 3 21<br />
2438492867 1 65<br />
2438492890 0 92<br />
2438492913 1 12<br />
2438492936 1 77<br />
2438492959 1 100<br />
2438492982 1 50<br />
2438493005 3 67<br />
2438493028 1 98<br />
2438493051 1 5<br />
2438493074 2 77<br />
2438493097 0 3<br />
etc</span></p>
<p dir="ltr">
Fortunately a meter ID can be a ten or eleven digit integer so it was easy to use the fill instruction to generate 500 quickly</p>
<p dir="ltr">
The next step was to modify the rules file, which I subsequently saved as smartmeter.simnet.rules. I won&#39;t post the entire file here, but to give an idea I&#39;ll show how I dealt with the vtype 3 events.</p>
<p dir="ltr">
The tokens sent by the event generator were something like this:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">$Agent -&gt; MachineLogs<br />
$Group -&gt; Link<br />
$Summary -&gt; Port failure : port reset<br />
$Severity -&gt; 2<br />
$PortNumber -&gt; 4<br />
$EventNumber -&gt; 2491<br />
$Node -&gt; 2438496018<br />
$DateString -&gt; 04/06/2014 16:49:49<br />
@FirstOccurrence -&gt; 1401896989<br />
@LastOccurrence -&gt; 1401896989<br />
$ServiceLevel -&gt; 0</span><br />
<br />
Re-assigning these tokens to the alarm types suggested in the table above was achieved by an if statement followed by a switch statement using $PortNumber to separate the possibilities</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">if (nmatch($Summary, &quot;Port failure&quot;))<br />
&nbsp;&nbsp; &nbsp;{<br />
&nbsp;&nbsp; &nbsp;switch ($PortNumber) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;1&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Clock Error Detected&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 3<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Controller Module&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;2&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Checksum Error&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 2<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Comms Module&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;3&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Config Error Detected&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 2<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Controller Module&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;4&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Fatal Error&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 5<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Controller Module&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;5&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Low Battery Detected&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 3<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Battery&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;6&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Temperature Threshold Breached&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 3<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Measurement&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;7&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Demand Overload Detected&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_Security&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 4<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Measurement&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;case &quot;8&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Measurement Error Detected&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity =4<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;Measurement&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;default:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Unknown Error&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertGroup = &quot;SM_System&quot;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Severity = 1<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@AlertKey = &quot;unknown&quot;<br />
&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;}</span><br />
&nbsp;</p>
<p dir="ltr">
A further refinement is to add a lookup table so the alarms can be enriched. In production systems I&#39;d advocate using Netcool Impact for event enrichment but for a simple visualisation demo I prefer the simplicity of a lookup table. Again I used Excel with its sort capabilities. In particular I wanted to make power fail alarms happen in only one or two geographies. I also wanted to assign substation IDs and create customers for meters. The last requires a list of names. The US Congress has just over 500 members and lists of senators and house members are on the internet, so it wasn&#39;t difficult to get those names. The first few lines of the lookup file are these, and coincidentally include a name non-Americans may have heard of:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">Meter_ID&nbsp;&nbsp; &nbsp;County&nbsp;&nbsp; &nbsp;Substation_Code&nbsp;&nbsp; &nbsp;Customer&nbsp;&nbsp; &nbsp;Phone<br />
2438492867&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Miller,G.&quot;&nbsp;&nbsp; &nbsp;909-215-1204<br />
2438492890&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Waxman,H.&quot;&nbsp;&nbsp; &nbsp;909-215-1767<br />
2438493327&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Pelosi,N.&quot;&nbsp;&nbsp; &nbsp;909-215-1274<br />
2438493488&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Rohrabacher,D.&quot;&nbsp;&nbsp; &nbsp;909-215-1110<br />
2438493695&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Waters,M.&quot;&nbsp;&nbsp; &nbsp;909-215-1734<br />
2438493833&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Becerra,X.&quot;&nbsp;&nbsp; &nbsp;909-215-1236<br />
2438493902&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Calvert,K.&quot;&nbsp;&nbsp; &nbsp;909-215-1567<br />
2438493948&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Eshoo,A.&quot;&nbsp;&nbsp; &nbsp;909-215-1196<br />
2438494155&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;McKeon,B.&quot;&nbsp;&nbsp; &nbsp;909-215-1873<br />
2438494201&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Roybal-Allard,L.&quot;&nbsp;&nbsp; &nbsp;909-215-1982<br />
2438494224&nbsp;&nbsp; &nbsp;Hamestead&nbsp;&nbsp; &nbsp;AC-065&nbsp;&nbsp; &nbsp;&quot;Royce,E.&quot;&nbsp;&nbsp; &nbsp;909-215-1299</span><br />
&nbsp;</p>
<p dir="ltr">
I made up some county names by following the history of some New York boroughs. Readers may or may not know that Harlem and Brooklyn take their names from the Dutch towns of Haarlem and Breukelen, a throw back to the days when New York was New Amsterdam. So I found more towns and villages from the same district in the Netherlands and anglicised the spellings. The curious might like to work back, but if it&#39;s a quiz there is no prize.</p>
<p dir="ltr">
This lookup table is then used in the usual way.</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">table smdetails=&quot;$NCHOME/omnibus/probes/linux2x86/smartmeter.simnet.lookup&quot;<br />
default = {&quot;unknown&quot;, &quot;unknown&quot;, &quot;unknown&quot;, &quot;unknown&quot; }</span></p>
<p dir="ltr">
as the first two lines of the rules file and then near the end:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">&nbsp;&nbsp;&nbsp; [@Location, @Service, @Customer, $phone] = lookup (@Node,smdetails)<br />
&nbsp;&nbsp; &nbsp;@Customer = @Customer + &quot;:&quot; + $phone</span><br />
&nbsp;</p>
<p dir="ltr">
The final piece of work is to create a props file that tells the probe which def and rules file to use, and then, while we&#39;re about it, we can slow down the operation of the probe so there is time to talk about things when giving the demo.</p>
<p dir="ltr">
And the result? My first effort was a WebGUI 7.4.1 offering:</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/webGUI74_sm.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/webGUI74_sm.png" style=" width:100%; display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
Wires are set up so that clicking on an icon on the map or in the county summary changes the event list window so that only the events relevant to the icon are shown</p>
<p dir="ltr">
I have uploaded the rules and other files to this blogsite, they are available <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_SAMPLE_FILES/smartmeter_simnet.zip">here</a></p>
Network status visualisation has come a long way since the days when an event list was the only way to view alarms. With the imminent arrival of new versions of OMNIbus and WebGUI the quality of visualisation available is about to take another big step forward...101296urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-aa9402ab-376f-4ad4-b545-88d94eccc8c6Getting more out of OMNIbus - using dynamic properties for stateful probe rules files7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-04-10T04:21:22-04:002014-04-10T04:21:22-04:00<p dir="ltr">
Recently, in my blog on using Node-RED with OMNIbus, I made a statement that the advantage of Node-RED was that the state of a variable could be carried over to subsequent runs through the rules script as this was not available in probe rules. I was told, by no less an authority than Kristian Stewart, that this was incorrect. Probe rules can create and modify probe properties on the fly and these can be used to carry variables through. I thought that I must give this a go as to me this had been a well kept secret.</p>
<p dir="ltr">
I was of course well aware that in probe rules @Variable indicates a field in the target Object Server and $Variable indicates either a token passed to the rules file or a variable created by the rules file. What was new to me was that there is a third prefix and that %Variable indicates a probe property. Typical uses of this capability are to change a property dynamically, for example</p>
<p dir="ltr">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <em>%RawCapture = 1</em></p>
<p dir="ltr">
will turn on raw capture for the event being processed. (And all subsequent events unless a matching <em>%RawCapture = 0 </em>is put in earlier in the rules file to turn raw capture off at the start of each run of event processing).</p>
<p dir="ltr">
In this case the property (RawCapture) is one defined in the props file. However if the property is not defined in the props file then it will be created as a transient property. Transient, because it will be destroyed if the probe is restarted.</p>
<p dir="ltr">
Now this gave me an idea. Probes such as the ping probe do more than give a binary result of good or bad, they can also give an intermediate poor result. Or to be specific the ping probe reports a node to be active.slow or not reachable. When the probe reports a slow response it would be useful if, on de-duplication, the event indicates whether things are getting worse or better. So that we can have an event display like this:</p>
<p dir="ltr">
<br />
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/pingprobe.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/pingprobe.png" style=" width:100%; display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
As with all probe rules file work it is important to understand what data we have to work with. With the ping probe, as with others, this is available in the probe documentation, in the section on elements. It is also useful when developing rules files to turn on details with this line near the end of the rules file</p>
<p dir="ltr">
<em>details ($*)</em></p>
<p dir="ltr">
This gives us examples of the tokens received by the probe as well as any $ variables created in the rules</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/pingprobe1.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/pingprobe1.png" style=" display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
From this we can see that the ping probe returns some useful information. Or at least it does if the $status is &quot;slow&quot;, for some reason this is not returned for the &quot;alive&quot; status. The first thing then is to extract the trip time from the icmp stats and put it into a field in the Object Server.</p>
<p dir="ltr">
The second thing is to set up the probe so that the slow status threshold is set a bit lower than the default. The defaults were after all defined in the days when 64kbps was thought to be quite a fast line speed and 2mbps was a typical core network line speed. In today&#39;s world of Gigabit ethernets triple digit millisecond delays can indicate problems with buffers or congested paths. The trip and trigger time properties can be set globally in the props file but I prefer to set them individually for each host in the ping file:</p>
<p dir="ltr" style="margin-left: 40px;">
<em><a href="http://www.google.com">www.google.com</a>&nbsp; 30&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 50 &nbsp;&nbsp;&nbsp;&nbsp; 1500<br />
<a href="http://www.bbc.com">www.bbc.com</a>&nbsp;&nbsp;&nbsp;&nbsp; 30&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 50&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1500<br />
<a href="http://www.ibm.com">www.ibm.com</a>&nbsp;&nbsp;&nbsp;&nbsp; 30&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 50&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1500<br />
localGateway&nbsp;&nbsp;&nbsp; 15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 25&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1500</em></p>
<p dir="ltr">
To hold the round trip time in the event I created a field called PingDelay, and put it into the display view as well, and put a couple of lines in the rules file to populate it. (Details do not update on de-duplication so the icmp_stats display shown above will be that at event insertion and stay the same thereafter)</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( exists ($icmp_stats)) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @PingDelay = extract ($icmp_stats, &quot;time=([0-9]+) ms&quot; )</em><br />
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</em><br />
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update (@PingDelay)</em></p>
<p dir="ltr">
The update line is to make sure that de-duplicated events carry the latest information. The OMNIbus field could be given the property of update on de-duplication but I regard an update command in the rules file to be a more reliable option.</p>
<p dir="ltr">
We now want to persist the value of @PingDelay so that the next time the probe rules run we can test whether the result is worse or better than the previous one. So if we want to track the performance of pings to www.ibm.com we can use a dynamic property and use that to test things</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( int(@PingDelay) &gt; int(%ibm_ping) ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity = int(@Severity) + 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Summary = @Summary + &quot;: Performance worsening : last value = &quot; + @PingDelay + &quot;ms : previous value = &quot; + %ibm_ping + &quot;ms&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity = int(@Severity) - 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Summary = @Summary + &quot;: Performance improving : last value = &quot; + @PingDelay + &quot;ms : previous value = &quot; + %ibm_ping + &quot;ms&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; %ibm_ping = @PingDelay</em><br />
<br />
This will work if the ping probe is only pinging a single host. If the ping file contains multiple hosts then this section of code needs wrapping inside a switch statement with a separate header for each host, for example:</p>
<p dir="ltr">
<em>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp; switch (@Node) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case &quot;www.bbc.com&quot; :<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( int(@PingDelay) &gt; int(%bbc_ping) ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ..... etc</em><br />
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case &quot;www.google.com&quot; :<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( int(@PingDelay) &gt; int(%google_ping) ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ..... etc</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <em>case &quot;www.ibm.com&quot; :<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if ( int(@PingDelay) &gt; int(%ibm_ping) ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ..... etc<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; default:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Summary = @Summary + &quot;:&quot; + $icmp_stats<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</em><br />
&nbsp;</p>
<p dir="ltr">
What would be nice would be the ability to generate the property names dynamically as well but I don&#39;t think that is possible. I stand to be corrected though</p>
<p dir="ltr">
&nbsp;</p>
Recently, in my blog on using Node-RED with OMNIbus, I made a statement that the advantage of Node-RED was that the state of a variable could be carried over to subsequent runs through the rules script as this was not available in probe rules. I was told, by...001826urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-00df5a73-9f6d-4daf-a5de-2e83c508a80aGetting more out of OMNIbus - finishing off spike and step alarms7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-02-28T08:31:24-05:002014-03-04T09:58:59-05:00<p dir="ltr">
Last time in this blog I looked at creating alarms when data values collected every five minutes suddenly changed significantly, what I called <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_spike_and_step_alarms?lang=en">spike&nbsp;alarms</a>. I only looked at creating the alarms, and I didn&#39;t look further into what the event life cycle would be. This is important because you create problems for OMNIbus if htere is no means of clearing alarms when they are no longer relevant. The plan I propose is:</p>
<ul dir="ltr">
<li>
Create an alarm when a 5% difference between successive measures is detected.<br />
&nbsp;</li>
<li>
Clear the spike alarm if measures return to baseline within ten minutes<br />
&nbsp;</li>
<li>
Reduce the severity and change alarm to a step alarm if measures remain at new levels for thirty minutes<br />
&nbsp;</li>
<li>
Reduce the severity and change to a trend alarm if measures settle between spike level and baseline.</li>
</ul>
<p dir="ltr">
Step alarms may merely be reporting a planned change and spike alarms are unlikely to be reacted to immediately so what is the value of creating them? Well not a lot if the events come as single spies, but if they come in battalions (I&#39;m referencing Hamlet here if you hadn&#39;t noticed) then that is useful to know. In OMNIbus this will be indicated by an alarm with a high Tally count, but more usefully, if OMNIbus is fronting a service management tool such as Maximo then these spike alarms can be sent into a modelling package which determines whether maintenance procedures should be reviewed.</p>
<p dir="ltr">
Drawing up a flow chart might be helpful.</p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/spike_flow-2a.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/spike_flow-2a.png" style="width: 600px; display: block; margin: 0px auto; text-align: center;" /></a></p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
To recap on earlier blogs, I am using a piece of software from IBM Hursley called <a href="http://nodered.org/">Node-RED</a> to provide more sophisticated processing of data than a probe rules file can manage. The most important extra feature of Node-RED is that its flows can remember the results of previous runs using its context{} object, unlike a probe rules file which can only run with what it is given. (<span style="color:#8b4513;"><em>Or as I thought, but it appears that probes can be made stateful by creating a property on the fly and assigning it a value - something for a later blog I think</em></span>) There are three things that need to be remembered between individual runs, namely:</p>
<p dir="ltr" style="margin-left: 40px;">
context.store - which is set to the measured value at the end of the run so that it is remembered for the next time</p>
<p dir="ltr" style="margin-left: 40px;">
context.sbase - which is set to the store value when the first alarm is created so subsequent runs can compare metrics against the alarm baseline</p>
<p dir="ltr" style="margin-left: 40px;">
context.salarm - which is used to mark whether an alarm state has been set and also whether the clear times have expired</p>
<p dir="ltr">
As before, it is important to have some understanding of the monitored environment. As we have been storing all measures in a log file it is a simple job to get a days worth of data and put it into Excel to create a chart.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/transfers-chart.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/transfers-chart.png" style="width: 700px; display: block; margin: 0px auto; text-align: center;" /></a></p>
<p dir="ltr">
From this chart we can see that only the international transfers have the step pattern these alarms are looking for, the national transfers (North-South Scotland-England) have a different sort of variation. Each transfer needs to be analysed individually and that is done by creating array variables and using a for loop to step through each one. This was covered in my <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_spike_and_step_alarms?lang=en">previous&nbsp;blog</a>. Using arrays means that limiting the analysis to the three international transfers is easily done simply by restricting the for loop scope to cover just values 3 to 5 instead of 3-7, as so:</p>
<p dir="ltr" style="margin-left: 40px;">
<em>for (i=3; i &lt; 6; i++) {</em></p>
<p dir="ltr">
In summary then, over the last few blogs I have covered how to obtain three sorts of alarms out of a single small web page of data. Obviously monitoring the UK National Grid in this way is not a practical proposition but I hope I have demonstrated the principles and that these techniques can be translated to more real world applications. I should also say that while I have used Node-RED with OMNIbus to create this monitoring package, IBM has other tools as well, The new smart analytics for Cloud can certainly do similar things and to a much higher scale, but that may be overkill for many customers. ITM agents can also perform some of the things Node-RED has been used for, but again, ITM may not be suitable for all customers.</p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/spike_flow-2d.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/spike_flow-2d.png" style=" width:100%; display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
So far I have only displayed the alarms generated from metrics in simple event lists. Obviously a WebGUI display with maps and charts as well as event lists would provide a much better overview.</p>
<p dir="ltr">
&nbsp;</p>
<h3 dir="ltr">
Node-RED function node javascript</h3>
<p dir="ltr">
The script below is what I used in this demo and is included for information only. I&#39;m sure it could be improved upon.</p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// This function checks for big changes between successive collections</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">var transfers = [];<br />
var names = [];<br />
names[0] = &quot;timestamp&quot;;<br />
names[1] = &quot;demand&quot;;<br />
names[2] = &quot;frequency&quot;;<br />
names[3] = &quot;NI_to_GB&quot;;<br />
names[4] = &quot;France_to_GB&quot;;<br />
names[5] = &quot;Netherlands_to_GB&quot;;<br />
names[6] = &quot;North_to_South&quot;;<br />
names[7] = &quot;Scotland_to_England&quot;;</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// Creates context variables if they don&rsquo;t already exist<br />
context.store = context.store || new Array();<br />
context.salarm = context.salarm || new Array();<br />
context.sbase = context.sbase || new Array();<br />
// msg2 is for debugging purposes<br />
// msg2 = {};</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// load up with input data<br />
transfers[0] = msg.payload.timestamp;<br />
transfers[1] = msg.payload.demand;<br />
transfers[2] = msg.payload.frequency;<br />
transfers[3] = msg.payload.NIGB;<br />
transfers[4] = msg.payload.FGB;<br />
transfers[5] = msg.payload.NLGB;<br />
transfers[6] = msg.payload.NthSth;<br />
transfers[7] = msg.payload.ScotEng;<br />
msg = {};</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// initialise for first use<br />
if (typeof context.scounter == &#39;undefined&#39;) {<br />
context.scounter = 1;<br />
for (i=0; i &lt; 8; i++) {<br />
context.store[i] = transfers[i];<br />
context.salarm[i] = 0;<br />
context.sbase[i] = transfers[i];<br />
}<br />
// msg2.payload = msg2.payload + &quot;-&quot; + context.store[0] + &quot;-&quot; +&nbsp; context.store[3];<br />
// return msg2;<br />
}<br />
else {<br />
localcounter = 1;<br />
for (i=3; i &lt; 8; i++) {</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// first test is to see whether metric is already in alarm<br />
if (context.salarm[i] == 0) {</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">// test for 5% step<br />
if ( transfers[i] &gt; context.store[i] * 1.05 || transfers[i] &lt; context.store[i] * 0.95 ) {<br />
&nbsp; FreeformText = &quot;Alert: &quot; + names[i] + &quot; transfer has changed by more than 5%&quot; ;<br />
&nbsp; MessageType = &quot;Spike Alarm&quot; ;<br />
&nbsp; ProblemType = 5;<br />
&nbsp; // MonitoredValue = names[i];<br />
&nbsp; Severity = 4;<br />
&nbsp; context.salarm[i] = 3;<br />
&nbsp; context.sbase[i] = context.store[i];<br />
&nbsp; if (localcounter == 1) {<br />
&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + context.sbase[i] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp; localcounter++;<br />
&nbsp; }<br />
&nbsp; else {<br />
msg.payload = msg.payload + &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + context.sbase[i] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
}<br />
}<br />
else {<br />
context.store[i] = transfers[i];<br />
}<br />
}<br />
else {<br />
// check for return to baseline<br />
if ( transfers[i] &gt; context.sbase[i] * 0.99 &amp;&amp; transfers[i] &lt; context.sbase[i] * 1.01 ) {<br />
FreeformText = &quot;End of Alert: &quot; + names[i] + &quot; transfer spike of more than 5% has returned to base value&quot; ;<br />
&nbsp; MessageType = &quot;Spike Alarm&quot; ;<br />
&nbsp; ProblemType = 6;<br />
&nbsp; // MonitoredValue = names[i];<br />
&nbsp; Severity = 1;<br />
&nbsp; if (localcounter == 1) {<br />
&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + context.sbase[i] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp; localcounter++;<br />
&nbsp; }<br />
&nbsp; else {<br />
msg.payload = msg.payload + &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + context.sbase[i] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
}<br />
// clean up if that was the case<br />
context.salarm[i] = 0;<br />
context.sbase[i] = transfers[i];<br />
}<br />
else {<br />
// check for stability at new level<br />
if ( transfers[i] &gt; context.store[i] * 0.99 &amp;&amp; transfers[i] &lt; context.store[i] * 1.01 ) {<br />
context.salarm[i]++;<br />
if (context.salarm[i] &gt; 8) {<br />
FreeformText = &quot;End of Alert: &quot; + names[i] + &quot; transfer spike of more than 5% has stayed at new value&quot; ;<br />
&nbsp; MessageType = &quot;Spike Alarm&quot; ;<br />
&nbsp; ProblemType = 6;<br />
&nbsp; // MonitoredValue = names[i];<br />
&nbsp; Severity = 1;<br />
&nbsp; if (localcounter == 1) {<br />
&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp; localcounter++;<br />
&nbsp; }<br />
&nbsp; else {<br />
msg.payload = msg.payload + &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
}<br />
context.salarm[i] = 0;<br />
context.sbase[i] = transfers[i];<br />
}<br />
}<br />
else {<br />
// convert to trend alarm if settling between spike and base<br />
if ( transfers[i] &gt; context.sbase[i] &amp;&amp; transfers[i] &lt; context.store[i] ) {<br />
context.salarm[i] = 0;<br />
if (context.salarm[i] &gt; 7) {<br />
FreeformText = &quot;Information: &quot; + names[i] + &quot; transfer spike of more than 5% has reduced&quot; ;<br />
&nbsp; MessageType = &quot;Trend Alarm&quot; ;<br />
&nbsp; ProblemType = 5;<br />
&nbsp; // MonitoredValue = names[i];<br />
&nbsp; Severity = 2;<br />
&nbsp; if (localcounter == 1) {<br />
&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp; localcounter++;<br />
&nbsp; }<br />
&nbsp; else {<br />
msg.payload = msg.payload + &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
}<br />
context.salarm[i] = 0;<br />
context.sbase[i] = transfers[i];<br />
}<br />
// context.salarm[i]++;</span></span></em></p>
<p dir="ltr">
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">}<br />
else {<br />
// check for further deviation<br />
if ( transfers[i] &gt; context.sbase[i] * 1.08 || transfers[i] &lt; context.sbase[i] * 0.92 ) {<br />
&nbsp; FreeformText = &quot;Alert: &quot; + names[i] + &quot; transfer has changed by more than 8%&quot; ;<br />
&nbsp; MessageType = &quot;Spike Alarm&quot; ;<br />
&nbsp; ProblemType = 5;<br />
&nbsp; // MonitoredValue = names[i];<br />
&nbsp; Severity = 5;<br />
&nbsp; context.salarm[i] = 3;<br />
&nbsp; // reset sbase when this alarm is tiggered<br />
&nbsp; context.sbase[i] = transfers[i];<br />
&nbsp; if (localcounter == 1) {<br />
&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp; localcounter++;<br />
&nbsp; }<br />
&nbsp; else {<br />
msg.payload = msg.payload + &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + names[i] + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
}<br />
}<br />
else {<br />
context.salarm[i]--;<br />
}<br />
}<br />
}<br />
}<br />
}<br />
//wrap up and prepare for next run<br />
context.store[i] = transfers[i];<br />
// msg2.payload = msg2.payload + &quot;:&quot; + names[i] +&quot;|&quot; + context.salarm[i] + &quot;|&quot; + context.store[i] + &quot;|&quot; + context.sbase[i];</span></span></em></p>
<p dir="ltr">
<br />
<em><span style="font-size:11px;"><span style="font-family: courier new,courier,monospace;">}<br />
return msg;<br />
// return [msg,msg2];<br />
}<br />
return null;</span></span></em></p>
Last time in this blog I looked at creating alarms when data values collected every five minutes suddenly changed significantly, what I called spike&nbsp;alarms . I only looked at creating the alarms, and I didn&#39;t look further into what the event life...001062urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-ee107f19-7c11-496b-9b18-f11ddda04340Getting more out of OMNIbus - spike and step alarms7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-02-21T13:32:12-05:002014-02-21T13:32:12-05:00<p dir="ltr">
Over the last few weeks I have been blogging an exercise in creating meaningful alarms from regular usage statistics (<a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/More_from_OMNIbus/?lang=en">Getting&nbsp;more&nbsp;from&nbsp;Netcool&nbsp;OMNIbus)</a>. In this exercise I am using some statistics provided by the UK&#39;s National Grid and manipulating them using a tool developed by IBM Hursley (<a href="http://nodered.org/">Node-RED</a>) before feeding the alarms to OMNIbus via a TCP socket probe. Node-RED uses javascript to carry out far more tests and comparisons than are possible with a probe rules file.</p>
<p dir="ltr">
As a reminder, the page of data that is fetched from the&nbsp;<a href="http://www.nationalgrid.com/ngrealtime/realtime/systemdata.aspx">National&nbsp;Grid&nbsp;web&nbsp;site</a> provides this information:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">Demand: 49047MW<br />
17:15:00 GMT<br />
Frequency: 50.048Hz<br />
17:17:45 GMT</span><br />
<br />
<span style="font-family:courier new,courier,monospace;">System Transfers<br />
<br />
N.Ireland to Great Britain: -252MW<br />
France to Great Britain: 1992MW<br />
Netherlands to GB: 778MW<br />
14/02/2014 17:00:00 GMT<br />
<br />
North-South: 10037MW<br />
Scot - Eng: 2417MW<br />
14/02/2014 17:20:00 GMT</span></p>
<p dir="ltr">
So far I have written about how to use Node-RED to detect when the frequency of the grid drops below 50Hz - and to clear the alarm when three consecutive monitoring periods report the frequency is above 50Hz. And a second exercise was to generate an alarm when the trend in demand was steadily upwards, and again provide a clear when demand returned to the previous baseline level.</p>
<p dir="ltr">
The final alarm type to generate is the spike or step alarm. This is where the measured values suddenly jump. In my terminology the spike is when the monitored value jumps and returns back to the baseline, the step is when it jumps and stays at or around the new level.</p>
<p dir="ltr">
This time we will monitor the system transfers, but first it will be necessary to redesign the Node-RED flow so that the parsing of the National Grid page and the various alarm analyses are in separate function nodes.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/Node-RED_flow_2.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/Node-RED_flow_2.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
The parsing of the HTML is now carried out in a dedicated function node and the output forwarded to all three analysis function nodes. The msg.payload output from this parse function is structured into a stream of name value pairs which can then be used in the analysis functions</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.demand = demand;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.frequency = frequency;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.timestamp = timestamp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.NIGB = NIGB;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.FGB = FGB;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.NLGB = NLGB;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.NthSth = NthSth<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.ScotEng = ScotEng;</em></p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
msg.payload = {};
msg.payload.demand = demand;
msg.payload.frequency = frequency;
msg.payload.timestamp = timestamp;
msg.payload.NIGB = NIGB;
msg.payload.FGB = FGB;
msg.payload.NLGB = NLGB;
msg.payload.NthSth = NthSth
msg.payload.ScotEng = ScotEng;</pre>
<p dir="ltr">
The output is:</p>
<p dir="ltr">
<em>(Object) { &quot;demand&quot;: 49047, &quot;frequency&quot;: 50.048, &quot;timestamp&quot;: &quot;17:15:00 GMT&quot;, &quot;NIGB&quot;: 252, &quot;FGB&quot;: 1992, &quot;NLGB&quot;: 778, &quot;NthSth&quot;: 10037, &quot;ScotEng&quot;: 2417 }</em></p>
<p dir="ltr">
The receiving function can then treat each msg.payload.xxxx element as a variable.</p>
<p dir="ltr">
It is useful to put those variables into an array, as in the spike analysis function node.</p>
<p dir="ltr" style="margin-left: 40px;">
<em>var transfers = [];<br />
transfers[0] = msg.payload.timestamp;<br />
transfers[1] = msg.payload.demand;<br />
transfers[2] = msg.payload.frequency;<br />
transfers[3] = msg.payload.NIGB;<br />
transfers[4] = msg.payload.FGB;<br />
transfers[5] = msg.payload.NLGB;<br />
transfers[6] = msg.payload.NthSth;<br />
transfers[7] = msg.payload.ScotEng;</em></p>
<p dir="ltr">
As the variables are now in an array the actual analysis can be performed inside a for loop, which makes things easier should a new monitored variable need to be included.</p>
<p dir="ltr">
The initial analysis is quite straightforward, merely checking whether the new value varies from the previous one by more than 5%:</p>
<p dir="ltr">
<em>for (i=3; i &lt; 8; i++) {</em></p>
<p dir="ltr">
<em>if ( transfers[i] &gt; store[i] * 1.05 || transfers[i] &lt; store[i] * 0.95 ) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FreeformText = &quot;Alert: &quot; + names[i] + &quot; transfer has changed by more than 5%&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MessageType = &quot;Spike Alarm&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;ProblemType = 5;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MonitoredValue = names[i];<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Severity = 4;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; msg.payload = &quot;National Grid|&quot; + transfers[i] + &quot;|&quot; + transfers[2] + &quot;|&quot; + transfers[0] + &quot;|&quot; + MessageType + &quot;|&quot; + MonitoredValue + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<em>&nbsp; }</em></p>
<p dir="ltr">
The msg.payload is sent out to the socket probe as before and results in alarms appearing in OMNIbus.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm6.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm6.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
As in previous blogs however it&#39;s not enough to create an alarm, it&#39;s also necessary to define what the life of an alarm is - how is it going to be cleared and what processes need to be applied to it. With the simple <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en">threshold&nbsp;alarm</a> it was an obvious step to create a clear alarm when metrics recrossed the threshold in the homeward direction and let the generic clear take care of things in the normal way. With the <a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_trend_event?lang=en">trend&nbsp;event</a> the clear alarm was created when the trend went the opposite way, but a modified clear automation was needed to identify when the trend had returned to the point when the initial alarm was raised. These spike alarms have a different life, and a plan needs to be defined for them. The plan I propose is:</p>
<ul dir="ltr">
<li>
Create an alarm when a 5% difference between successive measures is detected.<br />
&nbsp;</li>
<li>
Clear the spike alarm if measures return to baseline within ten minutes<br />
&nbsp;</li>
<li>
Reduce the severity and change alarm to a step alarm if measures remain at new levels for thirty minutes<br />
&nbsp;</li>
<li>
Reduce the severity and change to a trend alarm if measures settle between spike level and baseline.</li>
</ul>
<p dir="ltr">
Step alarms may merely be reporting a planned change and spike alarms are unlikely to be reacted to in isolation. The value of picking them up is as an input into the service models a service tool such as Maximo provides.</p>
<p dir="ltr">
Creating these alarms is the topic for next time</p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
&nbsp;</p>
Over the last few weeks I have been blogging an exercise in creating meaningful alarms from regular usage statistics ( Getting&nbsp;more&nbsp;from&nbsp;Netcool&nbsp;OMNIbus) . In this exercise I am using some statistics provided by the UK&#39;s National Grid...001205urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-56b57a42-d7f8-4132-9779-5adaaaf84c01Getting more out of OMNIbus - improving a trend event7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-02-14T04:27:49-05:002014-02-14T04:27:49-05:00<p dir="ltr">
In my last blog (<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_trend_event?lang=en">&nbsp;Getting&nbsp;more&nbsp;out&nbsp;of&nbsp;OMNIbus&nbsp;-&nbsp;creating&nbsp;a&nbsp;trend&nbsp;event</a> ) I described a basic mechanism for producing a trend alarm and then clearing it when conditions returned back to a lower level. There are a couple of tweaks that can be done to improve this. Specifically I added three extra checks:</p>
<ul dir="ltr">
<li>
A threshold so that we don&#39;t clutter up the event list window with alarms for demand levels well below any level that might cause concern.<br />
&nbsp;</li>
<li>
An early bailout that is triggered by the second measure dropping back below the baseline figure<br />
&nbsp;</li>
<li>
An early bailout that occurs when it becomes clear the five out of six increases will not be achieved</li>
</ul>
<p dir="ltr">
The early bailouts are put in so that we don&#39;t risk a trend being missed because it crosses the boundary between two sets of six measures.</p>
<p dir="ltr">
The threshold is easy to apply with a simple if query:</p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (demand &gt; 40000 || context.counter &gt; 1 ) {</pre>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (demand &gt; 40000 || context.counter &gt; 1 ) {if (demand &gt; 40000 || context.counter &gt; 1 ) {</pre>
<p dir="ltr">
<em>if (demand &gt; 40000 || context.counter &gt; 1 ) {</em></p>
<p dir="ltr">
The comparison with context.counter is to ensure we don&#39;t inadvertently bypass the initialisation step. The curly brackets {} need to enclose all the comparison steps and it can be secured by following this up with an &quot;else&quot; section.</p>
<p dir="ltr">
As the early bailouts can be triggered at a number of points, it makes for neater programming to use javascript&#39;s function capability. The bailout function is defined and then that function can be called at other points in the script. The bailout function primarily to re-initialise counters and the context{} fields. The instance that triggers the bailout is in fact used to be the baseline for the next 30 minutes of monitoring. The code is:<br />
<br />
<em>function bailout() {<br />
&nbsp;&nbsp; &nbsp;context.counter = 1;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.counterplus = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.counterminus = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.lastmean = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.baseline = demand;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.lastdemand = demand;<br />
&nbsp;&nbsp; &nbsp;return null;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}</em><br />
<br />
The return instruction in functions return the program to the point where the function was called. This means that a &quot;return msg&quot; instruction within a function does not output that msg from the Node-RED function node. To cause that output another return command has to follow the function call. Thus the two early bailout checks are:<br />
<br />
<em>if (context.counter == 3 &amp;&amp; context.counterplus &gt;= 1) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if (demand &lt; context.baseline) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;bailout();<br />
&nbsp;&nbsp; &nbsp;return null;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}<br />
<br />
And:<br />
<br />
if (context.counter &gt;= 4 ) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if (context.counterplus &gt;=2 &amp;&amp; context.counterminus &gt;=2) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;bailout();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;return null;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}</em><br />
<br />
Having created the bailout function it can also be called if the demand metric is below the threshold set for monitoring.<br />
<br />
One more alarm type needs to be addressed - a spike alarm if there is a sudden step change in a monitored metric. And then finally the complete monitoring package needs to be unified under WebGUI. These will be the topics of the final blogs in this series</p>
<p dir="ltr">
&nbsp;</p>
In my last blog ( &nbsp;Getting&nbsp;more&nbsp;out&nbsp;of&nbsp;OMNIbus&nbsp;-&nbsp;creating&nbsp;a&nbsp;trend&nbsp;event ) I described a basic mechanism for producing a trend alarm and then clearing it when conditions returned back to a lower level. There are...00992urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-5faf524f-d545-49a7-9df2-15b90c83ee31Getting more out of OMNIbus - creating a trend event7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-02-08T08:43:27-05:002014-02-08T08:43:27-05:00<p dir="ltr">
Here I am going to continue with the theme of monitoring the UK National Grid using Netcool OMNIbus and Node-RED - see my earlier blogs in &nbsp;<a href="https://www.ibm.com/developerworks/mydeveloperworks/blogs/More_from_OMNIbus/?lang=en">&quot;Getting&nbsp;more&nbsp;out&nbsp;of&nbsp;Netcool&nbsp;OMNIbus&quot;</a> - and look at creating trend warning events.</p>
<p dir="ltr">
Threshold events are a useful way of detecting problems from metrics, but they are really only applicable where a clear limit can be defined, for example alarming when a current transformer reports 90 amps on a circuit where a breaker will cut in at 100 amps. Often though we are less interested in crossing a threshold than we are in knowing a potential runaway situation has arisen. In our National Grid example we want to know when demand is rising unusually fast.</p>
<p dir="ltr">
First of all though we need to understand the environment we are monitoring. In a previous blog I mentioned that I was routing every piece of collected data to a log file. Now it&#39;s time to run that file through Excel so that we can plot a typical day on the UK&#39;s national grid. The result is what one would expect with a rise in demand in the morning and a further rise in demand in late afternoon as domestic demand picks up as people go home but industrial and commercial demand has not yet started to tail off:</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/NG-day.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/NG-day.png" style="width: 600px; display: block; margin: 0px auto; text-align: center;" /></a></p>
<p dir="ltr">
So for this exercise want to be able to pick up if demand increases faster than a typical early morning spurt or if the late afternoon-early evening climb continues for a longer time than usual. And, always important in event management, we also need to be able to clear any alarms raised when the condition causing them has gone away. This is trending for events and operations though, which is a much more short term affair than the sort of data analysis trending done over a longer period for capacity planning purposes.</p>
<p dir="ltr">
Specifically I propose three conditions to be looked out for over a thirty minute period (six five minute collections):</p>
<ul dir="ltr">
<li>
A rise in demand greater than 5%<br />
&nbsp;</li>
<li>
A steady rise in demand where five out of the six monitoring periods are an increase on the one before<br />
&nbsp;</li>
<li>
A fall in demand where the mean of the last three periods is less than the mean of the first three periods<br />
&nbsp;</li>
</ul>
<p dir="ltr">
The last condition is to be used to clear alarms and will require a modification to the Generic Clear automation, of which more anon. The other two alarms are also prioritised, if the 5% increase occurs then the steady rise is assumed and will not be separately alarmed. This means that there will never be more than one alarm per metric. The mean of three values is also calculated and used instead of a single value to try and obviate false positives generated by a single rogue metric</p>
<p dir="ltr">
Our Node-RED flow then is the same as in my previous blog except that the analysis function is not looking for a threshold that has been breached but checking collected values against previous ones. Node-RED has a defined object - context{} - which can be used to store values from one execution of the script to the next. We will use this to hold the first value collected, the means calculated and the most recent value collected as well as counters to check progress.</p>
<p dir="ltr">
The first step in our function script is the parsing of the HTTP data as before, but the next step is to initialise the counters if it&#39;s the first execution of the script. We can test for first execution by checking if context.counter is defined or not.</p>
<p dir="ltr">
<em>&nbsp;&nbsp; if (typeof context.counter == &#39;undefined&#39;) {<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.counter = 1;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.countermax = demand;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.countermin = demand;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.counterplus = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.counterminus = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.lastmean = 0;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.baseline = demand;<br />
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;context.lastdemand = demand;&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</em></p>
<p dir="ltr">
From there the next step is to evaluate the newly collected value against the ones held by the context object and record whether it is a rise or fall.</p>
<p dir="ltr">
&nbsp;&nbsp;&nbsp; <em>context.counterplus = demand &gt; context.lastdemand ? ++context.counterplus : context.counterplus ;<br />
&nbsp;&nbsp;&nbsp; context.counterminus = context.lastdemand &gt; demand ? ++context.counterminus : context.counterminus ;<br />
&nbsp;&nbsp;&nbsp; context.countermax = demand &gt; context.countermax ? demand : context.countermax ;<br />
&nbsp;&nbsp;&nbsp; context.countermin = demand &lt; context.countermin ? demand : context.countermin ;</em></p>
<p dir="ltr">
The actual decision making on whether an alarm has to be generated is only carried out when six collected values have been assessed. Each of the three possible alarm conditions are tested, though the order in which they are examined means that if the first test is passed the others aren&#39;t examined. This meets our prioritisation requirement.</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp; if (context.counter == 6 ) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;if (context.lastmean &gt; (context.baseline * 1.0499)) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FreeformText = &quot;Warning: Demand has risen by more than 5% in the last thirty minutes&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MessageType = &quot;Trend Warning&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;ProblemType = 5;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MonitoredValue = &quot;demand&quot;;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Severity = 4;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;else if (context.counterplus &gt;= 5) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FreeformText = &quot;Warning: Demand has been rising for the last thirty minutes&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MessageType = &quot;Trend Warning&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;ProblemType = 5;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MonitoredValue = &quot;demand&quot;;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Severity = &quot;2&quot;;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;else if (context.counterminus &gt;= 4 || context.lastmean &lt; context.baseline) {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FreeformText = &quot;Information: Demand has been falling for the last thirty minutes&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MessageType = &quot;Trend Warning&quot; ;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;ProblemType = 6;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;MonitoredValue = &quot;demand&quot;;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Severity = 1;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;else {<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;msg.payload = &quot;no alarm&quot;;</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</p>
<p dir="ltr">
The alarm can then be created from these values and sent to OMNIbus via the TCP socket probe as before.</p>
<p dir="ltr">
If we use the rules file from last time we will notice that de-duplication acts on these alarms. The result is that if a trend continues for a second thirty minute period that the existing alarm is updated with a new Last Occurrence time and the Tally count incremented. The two problem alarm types can also overwrite each other. This may not be the behaviour desired so a line or two is needed in the rules file to define different Identifiers for different instances - adding the timestamp to the Identifier - will resolve that.</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if( match( @AlertGroup, &quot;Trend Warning&quot; ) ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Identifier = @Identifier + &quot; &quot; + $timestamp<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</em><br />
<br />
Whether we want to do this depends on what strategy we decide to adopt for clearing the events. Initially I propose here to clear a rising trend alarm when demand falls back below the reported level. However if multiple rising trend alarms are deduplicated the reported level will be updated and thus the earlier alarms will be prematurely cleared. If the alarms are not deduplicated then each individual rising alarm will be cleared in turn. This, in my opinion, provides for a much more useful display.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm4.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm4.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
We could use the Generic Clear automation to clear the alarms we are creating, except that Generic Clear would need modifying to take account of the different demand levels. Modifications to Generic Clear are dangerous as they could have unintended effects elsewhere so I prefer to make a copy and modify that. That does mean a change needs to be made to the filter statement so that the copy only works on the alarms we want it to. The astute will have noticed that in the code printed above I have used Types 5 and 6 instead of 1 and 2 for the problem and resolution indications. We also need to create a problem_trend_events table in the alerts database by copying the problem_events table and then adding a Demand field to it.</p>
<p dir="ltr">
The modified Generic Clear automation can then be modified.</p>
<p dir="ltr">
<em>begin<br />
&nbsp;&nbsp; &nbsp;-- Populate a table with Type 5 events corresponding to any uncleared Type 6 events<br />
&nbsp;&nbsp; &nbsp;for each row problem in alerts.status where<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;problem.Type = 5 and problem.Severity &gt; 0 and<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (problem.Node + problem.AlertKey + problem.AlertGroup + problem.Manager) in<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ( select Node + AlertKey + AlertGroup + Manager from alerts.status where Severity &gt; 0 and Type = 6 )<br />
&nbsp;&nbsp; &nbsp;begin<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;insert into alerts.problem_trend_events values ( problem.Identifier, problem.LastOccurrence,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;problem.AlertKey, problem.AlertGroup,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;problem.Node, problem.Manager, false, problem.PowerDemand );<br />
&nbsp;&nbsp; &nbsp;end;<br />
<br />
&nbsp;&nbsp; &nbsp;-- For each resolution event, mark the corresponding problem_events entry as resolved<br />
&nbsp;&nbsp; &nbsp;-- and clear the resolution<br />
&nbsp;&nbsp; &nbsp;for each row resolution in alerts.status where resolution.Type = 6 and resolution.Severity &gt; 0<br />
&nbsp;&nbsp; &nbsp;begin<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;set resolution.Severity = 0;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;update alerts.problem_trend_events set Resolved = true where<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;LastOccurrence &lt; resolution.LastOccurrence and Demand &gt; resolution.PowerDemand and<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Manager = resolution.Manager and Node = resolution.Node and<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;AlertKey = resolution.AlertKey and AlertGroup = resolution.AlertGroup ;<br />
&nbsp;&nbsp; &nbsp;end;<br />
<br />
&nbsp;&nbsp; &nbsp;-- Clear the resolved events<br />
&nbsp;&nbsp; &nbsp;for each row problem in alerts.problem_trend_events where problem.Resolved = true<br />
&nbsp;&nbsp; &nbsp;begin<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;update alerts.status via problem.Identifier set Severity = 0;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp; &nbsp;end;<br />
<br />
&nbsp;&nbsp; &nbsp;-- Remove all entries from the problems table<br />
&nbsp;&nbsp; &nbsp;delete from alerts.problem_trend_events;<br />
end</em><br />
&nbsp;</p>
<p dir="ltr">
The result is that rising trend alarms are cleared when a falling trend alarm goes below the value when they were triggered. This gives a visualisation of the trend situation as things subside back to normal.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm5.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm5.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
This set up now works. However there are some improvements that could be made to make it more robust and efficient, and I&#39;ll cover these in the next blog</p>
<p dir="ltr">
&nbsp;</p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (typeof context.counter == &#39;undefined&#39;) {
context.counter = 1;
context.countermax = demand;
context.countermin = demand;
context.counterplus = 0;
context.counterminus = 0;
context.lastmean = 0;
context.baseline = demand;
context.lastdemand = demand;
// bailout() ;
}</pre>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (typeof context.counter == &#39;undefined&#39;) {
context.counter = 1;
context.countermax = demand;
context.countermin = demand;
context.counterplus = 0;
context.counterminus = 0;
context.lastmean = 0;
context.baseline = demand;
context.lastdemand = demand;
// bailout() ;
}</pre>
Here I am going to continue with the theme of monitoring the UK National Grid using Netcool OMNIbus and Node-RED - see my earlier blogs in &nbsp; &quot;Getting&nbsp;more&nbsp;out&nbsp;of&nbsp;Netcool&nbsp;OMNIbus&quot; - and look at creating trend warning...101234urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-d85cef84-6a71-498e-b5c2-1a2e7aab233eGetting more out of OMNIbus - giving a threshold event a metric of severity7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-01-31T09:21:48-05:002014-01-31T09:21:48-05:00<p dir="ltr">
In my previous blog (<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en">https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en</a>) I showed how using Node-RED with a TCP Socket probe meant we could have a threshold event that would not clear until three consecutive monitoring periods had been within the threshold. Quite often another threshold breach would occur before the third clear period and therefore the alarm would not clear. Additionally, if the temporal period of the DeleteClears automation is greater than the threshold monitoring period, the threshold may be breached again while the previously cleared event is still in the system and then this event would be updated. In short, it would be useful if we could indicate whether a threshold breach had been present for the entire life of the event or only for part of it - and if the latter how great a part.</p>
<p dir="ltr">
In cases like the threshold events created by the Node-RED flow I described last time where we know that each update occurs every five minutes we can calculate how long the threshold was breached, namely by multiplying @Tally by 300 seconds. Since we know how long the alarm has been active from the @FirstOccurrence and @LastOccurrence fields we can calculate a percentage, and if we put that in another field then we have the KPI we can display.</p>
<p dir="ltr">
We need an automation. The basic automation is quite simple:</p>
<p dir="ltr">
<em>declare<br />
&nbsp;&nbsp;&nbsp; proportionInAlarm integer;<br />
&nbsp;&nbsp; &nbsp;monitorPeriod integer;<br />
begin<br />
&nbsp;&nbsp; &nbsp;-- find the events that the KPI applies to<br />
&nbsp;&nbsp; &nbsp;for each row problem in alerts.status where problem.AlertGroup = &#39;Threshold Breach&#39; and problem.Tally &gt; 1<br />
&nbsp;&nbsp; &nbsp;begin<br />
&nbsp;&nbsp; &nbsp;set monitorPeriod = 300;&nbsp;&nbsp; -- sets default<br />
&nbsp;&nbsp; &nbsp;set proportionInAlarm = problem.Tally *monitorPeriod*100 / (monitorPeriod+ problem.LastOccurrence - problem.FirstOccurrence);<br />
&nbsp;&nbsp; &nbsp;update alerts.status via problem.Identifier set ThresholdEventKPI = proportionInAlarm;<br />
&nbsp;&nbsp; &nbsp;end;<br />
<br />
end</em></p>
<p dir="ltr">
There is one little gotcha. If you think the life of the event is the difference between LastOccurrence and FirstOccurrence you will get strange results, percentages of 198% for example. That is because the first occurrence of the event comes at the <strong>end </strong>of a monitoring period so we need to add the monitorPeriod value (300 seconds) to the life of the OMNIbus event to get the actual total monitoring period.</p>
<p dir="ltr">
Note also that I created a new field, @ThresholdEventKPI, to hold the result. Obviously this field needs to be created before the automation or OMNIbus will report an SQL error.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm3.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm3.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
This approach will not unfortunately work in those cases where repeat alarms come in randomly, but in any cases where the source is a timed or polled collection of data this automation, with a few tweaks, should give some extra information.</p>
<p dir="ltr">
&nbsp;</p>
In my previous blog ( https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_creating_a_threshold_event_with_hysteresis?lang=en ) I showed how using Node-RED with a TCP Socket probe meant we could have a...001447urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-21a6a6ca-1cd0-473b-a0bb-507cc7cc391bGetting more out of OMNIbus - creating a threshold event with hysteresis7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-01-30T10:11:58-05:002014-01-30T10:11:58-05:00<p dir="ltr">
In my last blog (https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_a_new_approach_to_event_creation_more_suited_to_smarter_infrastructure_monitoring?lang=en) I set up Node-RED, an IBM Hursley development, to work with OMNIbus taking a feed from the UK&#39;s National Grid. That blog just set up a TCP socket probe to receive the Node-RED output. In this blog I will take that further and set up a threshold event that will be triggered if the frequency on the grid drops below the nominal 50 Hertz.<br />
<br />
Setting up a threshold event can be done in a probe rules file but what I want to do here is to add some hysteresis. The alarm will be triggered by the grid frequency dropping below 50Hz but a clear event won&#39;t be sent until there have been three successive measures of frequency above 50Hz. The idea of this is that an alarm condition has to be clear for a period before the alarm can be cleared.<br />
<br />
There are alternative methods of introducing hysteresis. One possibility is to set the clear threshold higher than the trigger threshold, for example trigger an event when frequency drops below 50Hz but don&#39;t clear the event until the frequency reaches 50.1Hz. This is how hysteresis is introduced in TNPM. However that is not the best way when a tight threshold is required, as in this case.<br />
<br />
The first step is to set up the alarm trigger in Node-RED</p>
<p dir="ltr">
From the last blog we have a demonstration flow. A couple of modifications need to be made:</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/Node-RED_flow.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/Node-RED_flow.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
The output that sent every update to OMNIbus has now been redirected to a log file. It might be of interest to graph the measures of demand and frequency over time but that is not the task of OMNIbus. What we want in OMNIbus is events sent when something is wrong, in this case when the frequency drops below 50Hz. So we are using a second output to drive the probe. (For debugging I also set up a third port with just a debug node attached)</p>
<p dir="ltr">
To recall matters from my last blog, this flow is triggered every five minutes and an HTTP GET is made to the National Grid website. The NG returns a summary of its status, which is rendered by a browser as:</p>
<p class="small" dir="ltr">
<span style="font-family:courier new,courier,monospace;">Demand: 48173MW<br />
09:45:00 GMT<br />
Frequency: 50.029Hz<br />
09:49:45 GMT</span></p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">System Transfers</span></p>
<p class="small" dir="ltr">
<span style="font-family:courier new,courier,monospace;">N.Ireland to Great Britain: -230MW<br />
France to Great Britain: 1992MW<br />
Netherlands to GB: 894MW<br />
30/01/2014 09:30:00 GMT<br />
<br />
North-South: 6385MW<br />
Scot - Eng: 1215MW<br />
30/01/2014 09:52:00 GMT</span></p>
<p dir="ltr">
This output is then fed to a function node (Parse NG output) which contains a javascript script to parse that data into tokens and those are then sent to a log file or via a TCP socket to an OMNIbus socket probe and thus to OMNIbus.</p>
<p dir="ltr">
However the javascript in the function node can also do comparisons and manipulations, and this is what I want to examine first.</p>
<p dir="ltr">
Parsing the data from the National Grid uses javascript split and parse functions to extract the demand and frequency values:</p>
<p dir="ltr">
<em>var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);<br />
if (words.length &gt;= 3) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; demand = parseInt(words[0].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; timestamp = words[1].split(&quot;&gt;&quot;)[1];</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</p>
<p dir="ltr">
Note that frequency needs &quot;parseFloat&quot; as it is not an integer value.</p>
<p dir="ltr">
Now we have some variables to work with we can create a threshold alarm. As the AC frequency of the grid decreases under stress we can use this as a warning indicator. Nominally the grid frequency is 50Hz, so lets trigger an alarm when it drops below 50Hz. A simple &quot;if&quot; branch will do that and then all we need to do is craft the message to send to the socket probe.</p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (frequency &lt; 50 ) {<em>
MessageType = &quot;Threshold Breach&quot;;
ThresholdType = &quot;simple&quot;;
ProblemType = 1;
Severity = 3;
FreeformText = &quot;Warning: Frequency below 50Hz&quot;;</em></pre>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (frequency &lt; 50 ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MessageType = &quot;Threshold Breach&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ThresholdType = &quot;simple&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ProblemType = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Severity = 3;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FreeformText = &quot;Warning: Frequency below 50Hz&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = &quot;National Grid|&quot; + demand + &quot;|&quot; + frequency + &quot;|&quot; + timestamp + &quot;|&quot; + MessageType + &quot;|&quot; + ThresholdType + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg2];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</em></p>
<p dir="ltr">
When the grid frequency drops below 50Hz a single line message delimited by pipe characters and terminated by two newline characters is sent to the socket probe, which, assuming the props file and rules file are configured to do so, will create an alarm in OMNIbus.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm1.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm1.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
Now we could create a clear alarm in the same way when the frequency goes above 50Hz again, but that would be too simplistic. We want something a bit more sophisticated, namely:</p>
<ul dir="ltr">
<li>
only check for frequency being above 50Hz when it is in an alarm state from a previous below 50Hz state<br />
&nbsp;</li>
<li>
do not send a clear event until three consecutive above 50Hz reports have been received</li>
</ul>
<p dir="ltr">
To do this we need to hold values from previous inputs. Node-RED has an object defined called Context{} which can be used as an array to hold these values. So the first step is to add a couple of lines to the alarm raise code to set context values for when an alarm has raised and a counter:</p>
<p dir="ltr">
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.falarm = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.fcount = 3;</em></p>
<p dir="ltr">
Now we can use these to test whether an above 50Hz metric should trigger a clear event. Don&#39;t forget that for the Generic Clear automation to work the ProblemType needs to be set to 2 and the Severity needs to be set to 1.</p>
<p dir="ltr">
<em>else {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MessageType = &quot;Threshold Breach&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ThresholdType = &quot;simple&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FreeformText = &quot;Clear: Frequency now above 50Hz&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // only create clears if the alarm is actually set<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (context.falarm == 1 &amp;&amp; frequency &gt; 50) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.fcount = context.fcount - 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ProblemType = 2;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Severity = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (context.fcount == 0) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.falarm = 0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = &quot;National Grid|&quot; + demand + &quot;|&quot; + frequency + &quot;|&quot; + timestamp + &quot;|&quot; + MessageType + &quot;|&quot; + ThresholdType + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg2];</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</p>
<p dir="ltr">
The result is that after the third above 50Hz message is received while the flow is in an alarm state that a clear event is sent and OMNIbus&#39; Generic Clear automation uses that to clear both event to a green state.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm2.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/GridAlarm2.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
In practice it will be found that often one or two above 50Hz measures come in but the third metric is back below 50Hz so the clear event is not sent. However a raise event is. Deduplication takes care of this within OMNIbus and the tally count does suggest of a way to create a further KPI to show how serious the problem is. That is however something for the next instalment.</p>
<p dir="ltr">
<strong>Reference Material</strong></p>
<p dir="ltr">
<em><strong>Complete Function Node Code:</strong></em></p>
<p dir="ltr">
<em>&nbsp;&nbsp; // does a simple text extract parse of the http output to provide an<br />
&nbsp;&nbsp;&nbsp; // object containing the uk power demand, frequency and time<br />
&nbsp;&nbsp;&nbsp; // context {};<br />
<br />
&nbsp;&nbsp;&nbsp; if (~msg.payload.indexOf(&#39;&lt;BR&#39;)) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (words.length &gt;= 3) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; demand = parseInt(words[0].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; timestamp = words[1].split(&quot;&gt;&quot;)[1];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // log message<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2 = {};<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;msg3 = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload = demand + &quot;,&quot; + frequency + &quot;,&quot; + timestamp;</em><br />
<br />
<em>&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;// Test frequency against simple threshold.<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (frequency &lt; 50 ) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MessageType = &quot;Threshold Breach&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ThresholdType = &quot;simple&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ProblemType = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Severity = 3;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FreeformText = &quot;Warning: Frequency below 50Hz&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.falarm = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.fcount = 3;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = &quot;National Grid|&quot; + demand + &quot;|&quot; + frequency + &quot;|&quot; + timestamp + &quot;|&quot; + MessageType + &quot;|&quot; + ThresholdType + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg3.payload = context.falarm + &quot;,&quot; + context.fcount<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg,msg2,msg3];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; else {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; MessageType = &quot;Threshold Breach&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ThresholdType = &quot;simple&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; FreeformText = &quot;Clear: Frequency now above 50Hz&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // only create clears if the alarm is actually set<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (context.falarm == 1 &amp;&amp; frequency &gt; 50) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.fcount = context.fcount - 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ProblemType = 2;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Severity = 1;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (context.fcount &lt; 1) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; context.falarm = 0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = &quot;National Grid|&quot; + demand + &quot;|&quot; + frequency + &quot;|&quot; + timestamp + &quot;|&quot; + MessageType + &quot;|&quot; + ThresholdType + &quot;|&quot; + ProblemType + &quot;|&quot; + Severity + &quot;|&quot; + FreeformText + &quot;\n\n&quot;;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg3.payload = context.falarm + &quot;,&quot; + context.fcount<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg,msg2,msg3];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = null<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg3.payload = context.falarm + &quot;,&quot; + context.fcount + &quot;,&quot; + frequency<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg,msg2,msg3]<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // return [msg,msg2];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp; return null;</em></p>
<p dir="ltr">
<strong><em>Socket probe properties:</em></strong></p>
<p dir="ltr">
Delimiter&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;|&quot;<br />
SingleLines&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1<br />
StreamCapture&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0<br />
MessageLevel&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;warn&quot;<br />
MessageLog&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;$OMNIHOME/log/socket.log&quot;<br />
&nbsp;</p>
<p dir="ltr">
<strong><em>Socket probe rules:</em></strong></p>
<p dir="ltr">
########################################################################<br />
#<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Licensed Materials - Property of IBM<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;Restricted Materials of IBM&quot;<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5724-S44<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (C) Copyright IBM Corp. 2004, 2006, 2010.<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Netcool/OMNIbus Probe for Socket<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
#&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
#######################################################################<br />
<br />
#######################################################################<br />
# The following are the elements generated by this TCP/IP Socket Probe.<br />
#<br />
# $FQDN&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;: Fully Qualified Domian Name of the client.<br />
# $Hostname&nbsp;&nbsp; &nbsp;: IP address of the client.<br />
#<br />
# When used with the Node-RED National Grid Demo<br />
#<br />
# $Token000&nbsp;&nbsp; &nbsp;: Node-RED identifier<br />
# $Token001&nbsp;&nbsp; &nbsp;: Demand<br />
# $Token002&nbsp;&nbsp; &nbsp;: Frequency<br />
# $Token003&nbsp;&nbsp; &nbsp;: NG timestamp<br />
# $Token004&nbsp;&nbsp; &nbsp;: Message Type<br />
# $Token005&nbsp;&nbsp; &nbsp;: Threshold Type (none, simple, spike, trend)<br />
# $Token006&nbsp;&nbsp; &nbsp;: Problem Type (0,1,2)<br />
# $Token007&nbsp;&nbsp; &nbsp;: Severity<br />
# $Token008&nbsp;&nbsp; &nbsp;: Freeform Text<br />
#<br />
#######################################################################<br />
<br />
if( match( @Manager, &quot;ProbeWatch&quot; ) )<br />
{<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; switch(@Summary)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case &quot;Running ...&quot;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity = 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @AlertGroup = &quot;probestat&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Type = 2<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; case &quot;Going Down ...&quot;:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity = 5<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @AlertGroup = &quot;probestat&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Type = 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; default:<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity = 1<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @AlertKey = @Agent<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Summary = @Agent + &quot; probe on &quot; + @Node + &quot;: &quot; + @Summary<br />
}<br />
else<br />
{<br />
&nbsp;&nbsp; &nbsp;@Manager = %Manager<br />
<br />
&nbsp;&nbsp; &nbsp;# The entity that the alarm refers to<br />
&nbsp;&nbsp; &nbsp;@Node = $Token000<br />
<br />
&nbsp;&nbsp; &nbsp;# This should be the logical address of the entity, eg host:port.<br />
&nbsp;&nbsp; &nbsp;@NodeAlias = $Hostname<br />
<br />
&nbsp;&nbsp; &nbsp;# Should include name of vendor and system name.<br />
&nbsp;&nbsp; &nbsp;@Agent = &quot;Node-RED&quot;<br />
&nbsp;<br />
&nbsp;&nbsp; &nbsp;# Used to determine which set of tools are available when you right clic<br />
k on this event.<br />
&nbsp;&nbsp; &nbsp;@Class = 1150<br />
<br />
&nbsp;&nbsp;&nbsp; # This is the descriptive name of the type of the problem eg &quot;power stat<br />
us&quot;, &quot;link status&quot; etc.<br />
&nbsp;&nbsp; &nbsp;@AlertGroup = $Token004<br />
&nbsp;<br />
&nbsp;&nbsp; &nbsp;# This is the &#39;logical&#39; name of the managed object instance<br />
&nbsp;&nbsp; &nbsp;@AlertKey = $Token005<br />
<br />
&nbsp;&nbsp; &nbsp;# Map the data source severity directly, if it exists.<br />
&nbsp;&nbsp; &nbsp;# (Note: resolution events should be set to severity 1, the generic clea<br />
r will set them to 0 later)<br />
&nbsp;&nbsp; &nbsp;if (exists ($Token007)) {<br />
&nbsp;&nbsp; &nbsp;@Severity = $Token007<br />
&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;else {<br />
&nbsp;&nbsp; &nbsp;@Severity = 1<br />
&nbsp;&nbsp; &nbsp;}<br />
<br />
&nbsp;&nbsp; &nbsp;# Set to 2 for a problem, 2 for resolution<br />
&nbsp;&nbsp; &nbsp;if (exists ($Token006)) {<br />
&nbsp;&nbsp; &nbsp;@Type = $Token006<br />
&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;else {<br />
&nbsp;&nbsp; @Type = &quot;&quot;<br />
&nbsp;&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;@Identifier = @Node + &quot; &quot; + @AlertKey + &quot; &quot; + @AlertGroup + &quot; &quot; + @Type + &quot; &quot; + @Agent + &quot; &quot; + @Manager<br />
<br />
&nbsp;&nbsp; &nbsp;# Use summary from input if exists other wise define Summary here<br />
&nbsp;&nbsp; &nbsp;if (exists ($Token008)) {<br />
&nbsp;&nbsp; &nbsp;@Summary = $Token008<br />
&nbsp;&nbsp; &nbsp;}<br />
&nbsp;&nbsp; &nbsp;else<br />
&nbsp;&nbsp; &nbsp;switch ($Token004) {<br />
&nbsp;&nbsp; &nbsp;case &quot;Grid status event&quot;:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;National Grid Status at &quot; + $Token003 + &quot;: Demand -&quot; + $Token001 + &quot; MW : Frequency - &quot; + $Token002 + &quot; Hz&quot;<br />
<br />
&nbsp;&nbsp; &nbsp;default:<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;@Summary = &quot;Undefined event type&quot;<br />
&nbsp;&nbsp; &nbsp;}<br />
<br />
&nbsp;&nbsp; &nbsp;details ($*)<br />
}<br />
<br />
<br />
<br />
&nbsp;</p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);
if (words.length &gt;= 3) {
demand = parseInt(words[0].split(&quot;:&quot;)[1]);
frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);
timestamp = words[1].split(&quot;&gt;&quot;)[1];</pre>
In my last blog (https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/entry/getting_more_out_of_omnibus_a_new_approach_to_event_creation_more_suited_to_smarter_infrastructure_monitoring?lang=en) I set up Node-RED, an IBM Hursley development, to...001543urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-45d47a72-949a-4740-80b5-220f70bca4b5Getting more out of OMNIbus - a new approach to event creation more suited to smarter infrastructure monitoring7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2014-01-21T06:19:29-05:002014-01-21T06:19:29-05:00<p dir="ltr">
It may be hard to believe, but at one time IT devices didn&#39;t send alarms, they just scrolled a log file like a syslog and operators were expected to recognise there was a problem from the messages scrolling up the screen. Among the earliest OMNIbus probes were the syslog and logfile probes which replaced the operators&#39; eyeballs with code to recognise problems. From the mid 1980s though, IT devices started to be programmed to detect problems and to alert operators through flashing LEDs initially and then through messages to attached text only terminals. Or printers, I recall my network support team being alerted to a problem on our dial in modem racks by the sound of the dot matrix printer springing to life.<br />
<br />
In the intervening years we became accustomed to IT devices alerting us to problems through alarm messages following standards such as SNMP or X.733. The truth is though that engineering for alarms was constrained by budget and release schedules, but the alarm catalogue was generally rich enough for that not to be a major problem. It was however common to deploy performance monitoring - for example the Quallaby product that is now TNPM - to fill some of the gaps.<br />
<br />
Today however we are encountering the same problems with limited alarm provisioning as we start to support smarter infrastructure. The sensors and smart meters are mostly set up to send regular metrics and not so much to send alarms. A smart meter will certainly alarm if primary power is cut but if it reports that power utilisation is 3.7 kilowatt is that good or bad?<br />
<br />
How then do we turn regular metrics into alarms that can be used to warn of anomalous situations and trigger actions to investigate them? With the number of smarter infrastructure monitors being rolled out the data to be analysed will grow huge so it will have real value to be able to detect these anomalies quickly. The problem is not just finding a needle in a haystack - it&#39;s knowing there is a needle there to look for. Or five or ten needles.<br />
<br />
One obvious way to create an alarm from a metric is to set a threshold. It&#39;s simple to do with a line or two in a probe rules file, but the problem is that it is static. There are many times when there is a clear maximum level like a power limit or a temperature setting that will not change, but often it is difficult to set a threshold at a level that will catch a problem without generating a number of false positives first.<br />
<br />
Static thresholds are limiting, so what we need to do are create some dynamic thresholds. These can be thresholds set against a baseline, a threshold that compares a value against the previous value, or a threshold that is a trend over an hour or so. Those are all independent thresholds. Other thresholds might require comparing a value against another two or three different metrics collected from other sensors. For example a crane motor drawing x amps might be normal when it&#39;s lifting a heavy load but not when it is merely moving the hoist around.<br />
<br />
I think we can safely say that the logic required here is beyond the capabilities of a probe rules file. What we need is a flexible tool to do some pre-processing of metrics before the probe creates the alarm. We could do what we did before and bring in a performance tool like TNPM or even ITM, but in this blog I am going to draw your attention to a piece of software being developed in IBM Hursley called Node-RED. I won&#39;t give a full description of what Node-RED is or what it does, I refer you to the intranet site node-red.org for that, but I will show the architecture I used and describe how the first integration between OMNIbus and Node-RED was achieved.</p>
<p dir="ltr">
The architecture deployed is as in this diagram</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/OMNI-NR.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/OMNI-NR.png" style="width: 75%; display: block; margin: 0px auto; text-align: center; height: 75%;" /></a><br />
In this exercise we have used option 1, the socket probe. A proof of concept of option 3 has been done but that is still too immature to consider further.</p>
<p dir="ltr">
<strong>Monitoring the UK&#39;s National Grid</strong></p>
<p dir="ltr">
For this exercise we need a data source that can provide us with a regular stream of data and as the chaps in Hursley have created a demo exercise using the UK&#39;s national electricity grid, it seemed a good idea to use that. In that exercise Node-RED collects grid metrics from the URL <a href="http://www.nationalgrid.com/ngrealtime/realtime/systemdata.aspx">http://www.nationalgrid.com/ngrealtime/realtime/systemdata.aspx</a>. That URL returns:</p>
<p class="small" dir="ltr">
<span style="font-family:courier new,courier,monospace;">Demand: 51291MW<br />
17:00:00 GMT<br />
Frequency: 50.032Hz<br />
17:01:45 GMT</span></p>
<h3 dir="ltr">
<span style="font-family:courier new,courier,monospace;">System Transfers</span></h3>
<p class="small" dir="ltr">
<span style="font-family:courier new,courier,monospace;">N.Ireland to Great Britain: -252MW<br />
France to Great Britain: 1992MW<br />
Netherlands to GB: 1000MW<br />
20/01/2014 17:00:00 GMT<br />
<br />
North-South: 7260MW<br />
Scot - Eng: 1690MW</span><br />
<span style="font-family:courier new,courier,monospace;">20/01/2014 17:03:00 GMT</span></p>
<p class="small" dir="ltr">
The Node-RED demo also provides some sample code to collect this data at five minute intervals and to parse it so that demand and frequency can be isolated so we may as well use this to get us started. Node-RED also provides nice Debug nodes so we can test what each function node actually delivers, and teh output from the parsing node as suggested is:</p>
<p class="small" dir="ltr">
<span class="debug-message-payload">(Object) { &quot;demand&quot;: 47069, &quot;frequency&quot;: 50.059, &quot;time&quot;: &quot;10:45:00 GMT&quot; }</span></p>
<p class="small" dir="ltr">
<span class="debug-message-payload">Rather than muck around with that I defined an extra output from the function node, and with an extra couple of lines of javascript created an output more friendly to a Netcool socket probe:</span></p>
<p class="small" dir="ltr">
<span class="debug-message-payload"><span class="debug-message-payload">Node-RED|47078|50.016|10:50:00 GMT </span></span></p>
<p class="small" dir="ltr">
<span class="debug-message-payload"><span class="debug-message-payload">For reference the javascript of the parsing function is shown below, the original suggested by the Node-RED demo is in italics and my additions are in bold.</span></span></p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (~msg.payload.indexOf(&#39;&lt;BR&#39;)) {
var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);
if (words.length &gt;= 3) {
msg.payload = {};
msg.payload.demand = parseInt(words[0].split(&quot;:&quot;)[1]);
msg.payload.frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);
msg.payload.time = words[1].split(&quot;&gt;&quot;)[1];
// Create the true/false signal based on the frequency.
msg2 = {};
msg2.payload = (msg.payload.frequency &gt;= 50) ? true : false;
// Create a tcp output suitable for a socket probe
msg3 = {};
msg3.payload = &quot;Node-RED|&quot; + msg.payload.demand + &quot;|&quot; + msg.payload.frequency + &quot;|&quot; + msg.payload.time + &quot;\n\n\n&quot;;
//create message for probe node
msg4 = msg.payload;
return [msg,msg2,msg3,msg4];</pre>
<p class="small" dir="ltr">
&nbsp;</p>
<pre dir="ltr" style="position: fixed; left: -1000px;">
if (~msg.payload.indexOf(&#39;&lt;BR&#39;)) {
var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);
if (words.length &gt;= 3) {
msg.payload = {};
msg.payload.demand = parseInt(words[0].split(&quot;:&quot;)[1]);
msg.payload.frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);
msg.payload.time = words[1].split(&quot;&gt;&quot;)[1];
// Create the true/false signal based on the frequency.
msg2 = {};
msg2.payload = (msg.payload.frequency &gt;= 50) ? true : false;
// Create a tcp output suitable for a socket probe
msg3 = {};
msg3.payload = &quot;Node-RED|&quot; + msg.payload.demand + &quot;|&quot; + msg.payload.frequency + &quot;|&quot; + msg.payload.time + &quot;\n\n\n&quot;;
//create message for probe node
msg4 = msg.payload;
return [msg,msg2,msg3,msg4];</pre>
<p class="small" dir="ltr">
<em>&nbsp;if (~msg.payload.indexOf(&#39;&lt;BR&#39;)) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; var words = msg.payload.split(&quot;div&quot;)[1].split(&quot;&lt;BR&quot;);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (words.length &gt;= 3) {<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.demand = parseInt(words[0].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.frequency = parseFloat(words[2].split(&quot;:&quot;)[1]);<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg.payload.time = words[1].split(&quot;&gt;&quot;)[1];<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // Create the true/false signal based on the frequency.<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2 = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg2.payload = (msg.payload.frequency &gt;= 50) ? true : false;</em><br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
<strong>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; // Create a tcp output suitable for a socket probe<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg3 = {};<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; msg3.payload = &quot;Node-RED|&quot; + msg.payload.demand + &quot;|&quot; + msg.payload.frequency + &quot;|&quot; + msg.payload.time + &quot;\n\n\n&quot;;</strong><br />
<br />
<em>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return [msg,msg2</em>,<strong>msg3</strong>];</p>
<p dir="ltr">
A TCP socket probe can now be set up with properties set to parse single lines using the pipe symbol (&quot;|&quot;) as a delimiter and two blank lines to indicate the end of a record. In Node-RED a TCP output node is connected to the parse function and configured to send data to the TCP port and IP address of the socket probe.</p>
<p dir="ltr">
The probe now receives a set of tokens which can be viewed in Alert Details if the rules file is configured for that, else viewed in the log if message level is set to debug.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/node-red-tokens.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/node-red-tokens.png" style=" display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
From here on in it is simple rules file work to get the event correctly presented in OMNIbus</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/node-red-event.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/node-red-event.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
Naturally you will use deduplication to ensure that the event merely shows the latest status ...........</p>
<p dir="ltr">
So now we have the basic integration. In the next blog on this topic I will cover extending this to produce a grid monitoring visualisation using Node-RED and OMNIbus together</p>
<p dir="ltr">
&nbsp;</p>
It may be hard to believe, but at one time IT devices didn&#39;t send alarms, they just scrolled a log file like a syslog and operators were expected to recognise there was a problem from the messages scrolling up the screen. Among the earliest OMNIbus probes...101234urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-1ea8cd84-4280-4f1a-93e8-dada8601153bGetting more out of OMNIbus - Tracking those flapping alarms7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-07-31T06:04:58-04:002013-07-31T06:04:58-04:00<p dir="ltr">
One of the real selling points on Netcool OMNIbus has always been that it can reduce the numbers of alarms presented to network operators by really large proportions, 80% or more alarm reduction is not unusual. De-duplication, i.e. updating an existing alarm with a new last occurrence and tally rather than create a new entry for a duplicate alarm, has been a key tool in alarm reduction. The trouble with de-duplication however is that it smothers alarm history, so if an operator is faced with a transient problem, for example occasional failures in accessing a service reported by ITCAM, and can see that there are two or three alarms with high tally counts, then how can that operator deduce whether the service failures are down to an intermittent problem with a router interface or whether the stream of high memory utilisation events from a server is the cause. Being able to track which of these intermittent events occur simultaneously would be very useful.</p>
<p dir="ltr">
One way to do this would be to use archived alarms and build TCR reports, but this assumes that all alarms are being archived and that the operator has both ability and authority to create the TCR reports. It would also be a rather heavyweight solution to an issue that should really be a matter of selecting an alarm or three, right-clicking and selecting a &quot;track alarm&quot; tool. In this blog I am going to describe a possible way of achieving that.</p>
<p dir="ltr">
This first blog will describe how I went about proving the concept. In later blogs I will cover how to provide wrappers and reporting tools to turn the concept into a solution.</p>
<p dir="ltr">
What I didn&#39;t want to do was to change the de-duplication process in any way. The way I wanted to approach this was to select a tiny subset of the alarms coming in and have them write severity changes with timestamps to a custom table. To minimise the risk of overloading Object Servers with extra automations I thought it best to push that work out to the probe. That would be possible through checking an alarm&#39;s identity against a lookup table and using the genevent function that came with OMNIbus 7.3.1 to write to the custom table in addition to the probe writing the alarm to the normal alerts.status table. This does mean that when a new alarm is selected for monitoring that the probe needs to re-read the rules file but this can be achieved through the new HTTP features in OMNIbus 7.4.</p>
<p dir="ltr">
The first step is to create the custom table that will hold the alarm status changes. I&#39;ve called this table statmon.status and its structure is:</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonStatusTable.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonStatusTable.png" style=" display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
The key fields for monitoring and graphing are ChangeTime and Severity but I&#39;ve added other fields for manageability purposes. Identifying Owner and Group opens the possibility of running multiple traces by different operators later on. Having LocalIdentifier as well as Identifier gets round the issue that many rules files put either @Severity or @Type in the make up of @Identifier and we don&#39;t want that in this application, and the MonitorCount field lets us use an integer instead of a string as a filter in charts.</p>
<p dir="ltr">
Now we have somewhere to put the results we need to modify our probe rules file to go into monitor mode and post the changes in Severity. The first part of this is to create a lookup table in a file that will identify which alarms should send their new Severity to this table. This will be a three column table (key plus two variables) looking something like this:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">Dummy&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; no<br />
CC_Router_1A2_s1-0LinkMonLink&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes<br />
LEA_Router_1A1_s1-1LinkMonLink&nbsp; 2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes<br />
LEA_App_HBMachineMonSystems&nbsp;&nbsp;&nbsp;&nbsp; 3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; yes</span><br />
&nbsp;</p>
<p dir="ltr">
Later on we will look at scripts to add or remove lines from this file that can be invoked from a tool but in this proof of concept we will edit this file manually.</p>
<p dir="ltr">
In this proof of concept I have also used the simnet probe to generate events, so I added these lines to the simnet.rules. These lines are not specific to the simnet probe so could be added to other rules files as an include.</p>
<p dir="ltr">
At the top of the rules it is necessary to register the target Object Server tables as we will be sending data to multiple tables, and also define the lookup table as a file:</p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DefaultAlerts = registertarget(&quot;NCOMS&quot;, &quot;&quot;, &quot;alerts.status&quot;)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; changes = registertarget(&quot;NCOMS&quot;, &quot;&quot;, &quot;statmon.status&quot;)<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; table statmon = &quot;$OMNIHOME/probes/linux2x86/statmon.lookup&quot;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; default = { &quot;0&quot;, &quot;no&quot; }</span><br />
&nbsp;</p>
<p dir="ltr">
The lines that populate the stamon.status table are then added to the bottom of the rules file</p>
<p dir="ltr">
&nbsp;<span style="font-family:courier new,courier,monospace;"># Lines added to operate statmon</span></p>
<p dir="ltr">
<span style="font-family:courier new,courier,monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $LocalIdentifier = $Node + $Agent + $Group<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [$MonitorCount, $StatMon] = lookup ( $LocalIdentifier, statmon )<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (match($StatMon, &quot;yes&quot;)) {<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $now = getdate<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $TimedIdentifier = $LocalIdentifier + $now<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Acknowledged = 10<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update (@Acknowledged)<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; genevent (changes,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Identifier, $LocalIdentifier,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @LocalIdentifier, $TimedIdentifier,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @Severity, $Severity,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @ChangeTime, $now,<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; @MonitorCount, $MonitorCount<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; )</span><br />
<span style="font-family:courier new,courier,monospace;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</span></p>
<p dir="ltr">
As stated, the LocalIdentifier variable is necessary to strip out @Severity and/or @Type from the @Identifier field else we will only be monitoring a subset of the alarms we want to. LocalIdentifier can then be used as the key to the lookup table. I followed standard practice in setting a variable with the system time (getdate) and I also thought it useful to be able to identify those alarms being monitored which I did by setting @Acknowledged to 10. This had an unexpected but useful effect in that the event was given a grey colour in the Active Event List</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonAEL.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonAEL.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
At this stage I haven&#39;t used the Owner and Group fields, but the genevent tool sends the key data for tracking an event to the custom table as well as sending the alarm through the normal route. Only those events whose LocalIdentifier matches an entry in the look up table with &quot;yes&quot; in it will do that, all other events will ignore this section of the rules.</p>
<p dir="ltr">
The result is then that the probe sends the Severity of each new alarm to our statmon.status table with a timestamp</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonDataView.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonDataView.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
Now we have this data we need to be able to visualise it. For this proof of concept I thought I&#39;d use WebGUI&#39;s charts, and boy, what fun that was. Charting in WebGUI is deprecated and the future lies with DASH and the visualisations created by the RAVE project, but that has meant that there are some weird bugs now in charting. However I was able to put some charts together to demonstrate the principle.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonAlarmTrace.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/statmonAlarmTrace.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
These two traces show that the alarms we are monitoring are not related - much as you would expect with a simnet probe - but does show the possibilities of this approach to transient alarm monitoring.</p>
<p dir="ltr">
Now we have the concept we need to move on to making it a workable solution.</p>
<p dir="ltr">
Firstly we need to have a simple way of selecting an alarm for monitoring and starting the monitoring using a tool that modifies the lookup table and gets the probe to re-read its rules file, along with another tool that stops the monitoring and tidies things up.</p>
<p dir="ltr">
We also need some automations to perform housekeeping on the statmon.status table, and possibly an automation that culls monitoring that has been forgotten about.</p>
<p dir="ltr">
And then we need much better visualisation.</p>
<p dir="ltr">
All of this I will come back to in later blogs</p>
One of the real selling points on Netcool OMNIbus has always been that it can reduce the numbers of alarms presented to network operators by really large proportions, 80% or more alarm reduction is not unusual. De-duplication, i.e. updating an existing alarm...132702urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-97e50c31-3b00-45a5-93d6-d19c5c8afed8Getting more out of OMNIbus - reaching the man on the move7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-06-14T11:45:05-04:002013-06-14T11:45:05-04:00<p dir="ltr">
A few weeks ago I blogged about <a href="https://www.ibm.com/developerworks/community/blogs/roller-ui/authoring/weblogEntryDetail.do?entryId=2359823a-1d61-400f-821b-3b02eb8b7f3a&amp;method=view&amp;rmik=tabbedmenu.weblog.archives&amp;lnk=%2Froller-ui%2Fauthoring%2FweblogEntryManagement.do%3Fmethod%3Dquery%26rmik%3Dtabbedmenu.weblog.archives%26weblog%3DMore_from_OMNIbus%26status%3DALL%26pageSize%3D25%26reverse%3D1&amp;status=ALL&amp;lang=en" target="_blank">using&nbsp;event&nbsp;rate&nbsp;to&nbsp;alert&nbsp;to&nbsp;problems</a>, the idea being to free a technician from needing to watch an alerts view and allow them to go and do some more productive tasks when no serious issues existed. The missing component then was how to alert this technician when the event rate suddenly spiked indicating a problem had occurred that required them to return to their desk.</p>
<p dir="ltr">
A similar situation arose with a demo we have been building for a port authority. There the target audience for new alarms was a supervisor out on the dockside far away from a console in a control room.</p>
<p dir="ltr">
In both these scenarios a mechanism using the new mobile UI and OMNIbus&#39; ability to automatically send emails was used to create a practical solution.</p>
<p dir="ltr">
OMNIbus ships with an automation &quot;mail_on_critical&quot; which can be copied and modified to create a new trigger &quot;mail_on_eventrate_warning&quot;.</p>
<div dir="ltr">
<em>begin<br />
&nbsp;&nbsp;&nbsp; for each row eventrate in alerts.status where eventrate.Node = &#39;stats_eventrate&#39; and<br />
&nbsp;&nbsp;&nbsp; &nbsp; eventrate.Grade &lt; 2 and eventrate.Acknowledged = 0 and<br />
&nbsp;&nbsp;&nbsp; &nbsp; eventrate.LastOccurrence &lt;= ( getdate() - 120 )<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp; &nbsp; execute send_email( eventrate.Node, eventrate.Severity, &#39;Netcool Email&#39;, &#39;nettech<a href="mailto:whart57@gmail.com">@corpnet.com</a>&#39;, eventrate.Summary, &#39;localhost&#39;);<br />
&nbsp;&nbsp;&nbsp; &nbsp; execute jinsert( eventrate.Serial, %user.user_id, getdate(), &#39;Email notification sent&#39;);<br />
&nbsp;&nbsp;&nbsp; &nbsp; update alerts.status via eventrate.Identifier set Grade=2;<br />
&nbsp;&nbsp;&nbsp; end;<br />
end</em></div>
<p dir="ltr">
You may recall that the eventrate warning trigger created a synthetic event with @Node = &#39;stats_eventrate&#39; so this can be used in the filter.</p>
<p dir="ltr">
An alternative is to use Netcool Impact to generate the email via a policy, and this is the way it was done in the ports demo.</p>
<p dir="ltr">
An eventrate warning alarm will now send out an email to the address &#39;nettech@corpnet.com&#39;. As this automation stands the email is rather basic, but in the ports demo the Impact policy resulted in a more sophisticated email:</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/email.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/email.png" style=" display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
Including URLs in the body of the email means that it is easy for the recipient to open the Event Dashboard set up to view these events.</p>
<p dir="ltr">
The Event Dashboard is part of WebGUI and it needs, for this application, to be enabled for mobile UI. This is one of the settings available to a WebGUI administrator under the edit page dialogue. On a mobile device the Event Dashboard is rather basic but contains all the event details</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/event_dashboard.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/event_dashboard.png" style=" width:400px; display:block; margin: 0 auto;text-align: center;" /></a></p>
<p dir="ltr">
The limitation with the present mobileUI is that it is read only and there is no option to either update the Journal or to use any Tools from the mobile UI. Plans exist to enhance the UI in OMNIbus 8.1 next year to provide Tools but the usual caveats about intentions not being commitments apply.</p>
<p dir="ltr">
However there is a work around.</p>
<p dir="ltr">
You may have noted that the email sent out by Impact in the ports scenario had &quot;Serial = xxxx&quot; at the end of the Subject line. This creates a link to the event in OMNIbus through the unique @Serial field. Now if the recipient <strong>replies</strong> to the email, with or without putting in any text in the body, then the Impact policy will detect the email coming in, parse the Serial number of the event from the end of Subject line and use that to update the original event. In the ports scenario the update was to acknowledge the event and to mark it for Trouble Ticket creation.</p>
<p dir="ltr">
This meant that the ports demo had a complete scenario where an issue was detected through monitoring the mechanical components of a dock crane, an event was created in OMNIbus, a notification of that event was emailed out to a supervisor on the dockside who would be able to lift the covers on the crane&#39;s generator. If an engineer needed to be called then the supervisor could just reply to the email, that would generate the work order and when the work order was cleared then the OMNIbus alarm would be cleared too - and this would be visible to the dock supervisor on the same smartphone.</p>
<p dir="ltr">
Those who monitor alarms and events need no longer be desk-bound</p>
<p dir="ltr">
&nbsp;</p>
A few weeks ago I blogged about using&nbsp;event&nbsp;rate&nbsp;to&nbsp;alert&nbsp;to&nbsp;problems , the idea being to free a technician from needing to watch an alerts view and allow them to go and do some more productive tasks when no serious issues...202230urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-ef0cd22f-82da-44b6-b60f-67d1c648aa5dGetting more from OMNIBus - the HTTP interface and events on maps and things7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-05-14T08:38:05-04:002013-05-14T08:38:05-04:00<p dir="ltr">
Do you like WebGUI? Well for the sort of tasks a network operator normally does WebGUI is a robust UI tool well optimised for use with OMNIbus event management. However outside the network and systems management domain the tools provided by WebGUI aren&#39;t such a good fit and integrating into non-Tivoli portals can be a one way affair. You can put an Active Event List into an i-frame into another portal, but the chances are that the wires to enable in-context launches from that AEL aren&#39;t there so once in the AEL the only way out is a complete bail out. That may not be ideal, but it was what the developers of the Integrated Operations Centre had to accept if they wanted to incorporate the powerful event management capabilities of Netcool OMNIbus.</p>
<p dir="ltr">
Or perhaps WebGUI is overkill for the application in question, and all that&#39;s wanted is a couple of icons in the corner of a webpage.</p>
<p dir="ltr">
Or perhaps the visualisation desired involves geo-mapping, and what&#39;s wanted are representations of events on a Googlemaps or Bing display.</p>
<p dir="ltr">
For those attempting to implement such solutions what has been missing to date is the capability to query the OMNIbus Object Server from outside the Netcool environment and get the results back in a form that can be used in HTML and javascript pages. But not any more. Since the release of OMNIbus 7.4 in November 2012, OMNIbus has had an HTTP interface through which it is possible to send HTTP GET and POST commands and get the results back in json format. Standard json parsers will extract the individual collected fields and from there anyone with javascript knowledge can use that data to create webpage representations of events. And that includes geo-mapped representations, as long as the event contains latitude and longitude information, or that information can be added as part of the query process.</p>
<p dir="ltr">
The HTTP interface also supports OSLC, a developing standard increasingly implemented within IBM for Object integration across products, so the interface is sometimes referred to as the HTTP/OSLC interface.</p>
<p dir="ltr">
The HTTP is not enabled by default, so the first step is to modify the Object Server properties, which will require an Object Server restart. It would also be good practice to create two new fields in the alerts.status table to hold Latitude and Longitude if it&#39;s desired to do some geo-mapping. How these fields are populated would depend on the use case, but in many cases a probe rules look up table or Impact event enrichment would suffice.</p>
<p dir="ltr">
The simplest method of extracting data from OMNIbus using HTTP is to issue a GET command. The syntax is fairly simple but needs a little study to figure out. So let&#39;s look at an example which I&#39;ve colour-coded:</p>
<p dir="ltr">
<em><span data-mce-style="color: #0000ff;" style="color:#0000ff;"><a href="http://omnihost:8080/objectserver/restapi/alerts/status">http://omnihost:8080/objectserver/restapi/alerts/status</a></span>?<span data-mce-style="color: #ff0000;" style="color:#ff0000;">filter=Latitude%20not%20like%20%270%27%20and%20Class%3D3300&amp;collist=Node%2CAlertGroup%2CLatitude%2CLongitude%2CSeverity&amp;Serial%20ASC&amp;orderby=Serial%20ASC</span></em></p>
<p dir="ltr">
The blue section is the URI for the server and the table and is fairly straightforward. A DNS call or a lookup in the hosts file should resolve &quot;omnihost&quot; to the server&#39;s IP address, and in this case the HTTP listening port property of the Object Server has been set to 8080.</p>
<p dir="ltr">
The red section is an SQL query though it is a little hard to read as non-alphanumeric characters have been turned into hex duplets. If I replace these (%20 is the space character, %27 a single quote, %2C is a comma an %3C is the equal sign) then the red section reads:</p>
<p dir="ltr">
<em><em><span data-mce-style="color: #ff0000;" style="color:#ff0000;">filter=Latitude not like &#39;0&#39; and Class=3300&amp;collist=Node,AlertGroup,Latitude,Longitude,Severity&amp;Serial ASC&amp;orderby=Serial ASC</span></em></em></p>
<p dir="ltr">
<span data-mce-style="color: #ff0000;" style="color:#ff0000;"><span data-mce-style="color: #000000;" style="color:#000000;">Much more readable, and shows that this command will collect Node, AlertGroup and Severity as well as the new Latitude and Longitude fields</span></span></p>
<p dir="ltr">
<span data-mce-style="color: #000000;" style="color:#000000;">The page returned is a json object in two sections - a description of the fields in a &quot;rowset&quot; section and the contents under &quot;rows&quot;. If the HTTP GET was run from a browser the results are shown in that browser window, for example:</span></p>
<pre dir="ltr">
&quot;rows&quot;: [{
&quot;Node&quot;: &quot;Tokyo&quot;,
&quot;AlertGroup&quot;: &quot;Stats&quot;,
&quot;Latitude&quot;: &quot;35.7&quot;,
&quot;Longitude&quot;: &quot;139.75&quot;,
&quot;Severity&quot;: 4
}, {
&quot;Node&quot;: &quot;Beijing&quot;,
&quot;AlertGroup&quot;: &quot;Stats&quot;,
&quot;Latitude&quot;: &quot;39.9&quot;,
&quot;Longitude&quot;: &quot;116.4&quot;,
&quot;Severity&quot;: 4
}, etc etc</pre>
<p dir="ltr">
<br />
<br />
However if the GET was run from within javascript the results need parsing which can be done using the standard JSON.parse utility.</p>
<p dir="ltr">
The GET method only collects data, it doesn&#39;t allow for counts or calculations of averages. This is often useful and the HTTP interface supports a SQL Factory that enables count, average, max and min to be returned as well as the data. A POST command has to be sent to the SQL Factory and again if that is done within javascript then JSON.parse has to be applied to release the results for use elsewhere in the script.</p>
<p dir="ltr">
A zip file containing example scripts, including some to put events onto a map display can be fetched from <a href="https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=cdd16df5-7bb8-4ef1-bcb9-cefb1dd40581#fullpageWidgetId=W05de62601548_4e85_8940_81bb58657a85&amp;file=41ad58ad-c15a-4d90-aabd-495b748c466f" target="_blank">here</a>. How to use the scripts in a demo or test is written up in an attached comment. Please note that these scripts are released for education purposes.</p>
<p dir="ltr">
Have fun!</p>
Do you like WebGUI? Well for the sort of tasks a network operator normally does WebGUI is a robust UI tool well optimised for use with OMNIbus event management. However outside the network and systems management domain the tools provided by WebGUI aren&#39;t...203180urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-2359823a-1d61-400f-821b-3b02eb8b7f3aGetting more from Netcool OMNIbus - using event rates to alert to problems7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-04-29T12:24:01-04:002013-04-29T12:24:01-04:00<p dir="ltr">
Some years ago I visited an Icelandic telco to pitch the Netcool solution to them. Iceland is one of the world&#39;s smaller countries - more people work for IBM than live in Iceland - so this telco was a little more modest in scope than AT&amp;T or Sprint, and the network operations centre was unmanned most of the time. The network operator - there was only one - looked in from time to time, checked the lights indicating whether the PBXs were in alarm, looked at the alarm list for the routers of the IP data network and then went back to his desk to do other things. And that makes the point that monitoring alarms is not a very productive task. Business cases for monitoring systems focus on the costs of <strong>not </strong>picking up critical alarms, which can be quite substantial, and on how the costs of monitoring can be reduced by introducing automation and greater efficiency. Part of our pitch in Iceland was in fact on reducing the need for the network operator to break off what he was doing and stroll across to the NOC to see if anything required attention.</p>
<p dir="ltr">
Many businesses are like this tier 3 telco. The cost of having one or two people employed on watching screens scroll is hard to justify, particularly as these people could be doing something more visibly productive - like fixing some of the things that flagged up on their screens. A large tier 1 telco sees more than enough alarms a day to keep a team of level 1 technicians busy, smaller networks do not. How then to free the technicians from their monitoring screens and allow them to do something more productive than checking alarms and raising trouble tickets. This is a problem I had some years ago when I was running the network support team at a large financial institution. Few of the alarms that came in demanded instant response but those that did couldn&#39;t wait for the technician to return from sorting out a Token Ring LAN issue elsewhere in the building. What I needed then was a means of hauling the tech back on the one occasion a week when a rapid response WAS required. A dozen years on we have many more possibilities of doing that</p>
<p dir="ltr">
We could use alarm severities and OMNIbus&#39; Accelerated Event Notification, and that will work in many cases. However equipment vendors are notorious for classifying anything not merely informational as &quot;critical&quot;, but even if they do restrain themselves and reserve the highest priority for just the most serious cases, what is critical to an individual device may not be critical to the service going over it. In these days of redundant paths, virtual circuits and wireless connectivity our networks have fewer single points of failure and those that remain have only local impact. Looking for single alarms then may not be the way to go.</p>
<p dir="ltr">
However when sorrows come they come not as single spies but in battalions. (Hamlet, Act 4 Scene 5, just to show off my erudition). Serious network issues rarely manifest themselves as one or two alarms, normally we can expect a flood of symptom events to come in as well. The problem is usually that of sorting through the haystack for the needle that tells us the root cause. However we can use the existence of that flood to alert our technicians that they should head back to their desks. The first requirement then is to develop a way of accurately measuring event rates.</p>
<p dir="ltr">
OMNIbus ships with a set of triggers that measure database performance, and these maintain a running total of new inserts, deduplications, details and journal entries in the &#39;alerts&#39; database.</p>
<p dir="ltr">
<a href="http://tivoli-ug.org/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/7633.event_5F00_rate_2D00_1.png"><img alt=" " border="0" src="http://tivoli-ug.org/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/7633.event_5F00_rate_2D00_1.png" /></a></p>
<p dir="ltr">
We can use those running totals to calculate the event rate, but first we need to create a table to hold the result of our calculation. To begin with it only needs to have four columns and the trigger can restrict its size to a single row. The sql to create this table is:</p>
<p dir="ltr">
<i>create table master.event_rate virtual<br />
(<br />
DatabaseName varchar(64) primary key,<br />
EventCount int,<br />
LastUpdate UTC,<br />
EventRate Real<br />
);<br />
go</i></p>
<p dir="ltr">
It&#39;s a virtual table, so will be cleared when OMNIbus restarts</p>
<p dir="ltr">
We now write a trigger that will query our activity_stats table every 60 seconds and calculate the delta of new inserts and deduplications. Dividing that total by the elapsed time between queries will give us the alarm rate per second.</p>
<p dir="ltr">
<a href="http://tivoli-ug.org/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/3414.event_5F00_rate_2D00_2.png"><img alt=" " border="0" src="http://tivoli-ug.org/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/3414.event_5F00_rate_2D00_2.png" /></a></p>
<p dir="ltr">
We can run this trigger for a while and determine what the baseline event rate is, and then we can modify the trigger with a threshold statement to insert an event in alerts.status to report that the event rate is significantly higher than normal.</p>
<p dir="ltr">
Alternatively we can record the delta EventCount and use that as our trigger for alerting our techs. The choice will depend on the baseline normal performance of our network</p>
<p dir="ltr">
<a href="http://tivoli-ug.org/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/3225.event_5F00_rate_2D00_3.png"><img alt=" " border="0" src="http://tivoli-ug.org/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-02-61/3225.event_5F00_rate_2D00_3.png" /></a></p>
<p dir="ltr">
Once there is an event then that can be used to trigger emails and SMS messages and with WebGUI 7.4 the email could include a QR code which the tech could use to access a smartphone enabled alarm display. But that would be something for another blog.</p>
<p dir="ltr">
Consideration would also have to be given to clearing these event rate alarms. They shouldn&#39;t be automatically cleared once the event rate subsides to normal, a suitable period of hysteresis is required. But again, strategies for clearing and deleting old events may be something for a later blog.</p>
<p dir="ltr">
For reference, the automation is given below. It should be included in the stats_triggers group and that group needs to be enabled as this new automation references some of the other automations.</p>
<p dir="ltr">
<em>declare<br />
elapsed time;<br />
firstRun boolean;<br />
eventCount integer;<br />
deltaEventCount integer;<br />
eventRate integer;<br />
elapsedSec integer;<br />
begin<br />
&nbsp;&nbsp; &nbsp;set firstRun = true;<br />
&nbsp;&nbsp;&nbsp; for each row mrow in master.event_rate --this is our table<br />
&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set firstRun = false; --if there are no rows in my_peft_table, this will not happen<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set elapsed = getdate() - mrow.LastUpdate; --work out how much time has passed since the last run (by default this will usually be 60)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set eventCount = 0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set deltaEventCount = 0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for each row srow in master.activity_stats where srow.DatabaseName = &#39;alerts&#39;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set eventCount = srow.StatusNewInserts + srow.StatusDedups + srow.JournalInserts + srow.DetailsInserts; --add up all the events going into alerts database<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set deltaEventCount = eventCount - mrow.EventCount;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set eventRate = (eventCount - mrow.EventCount)/elapsed;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set elapsedSec = elapsed;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- This next step can be done (more efficiently) with one update. For ease of reading I have split it<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; update master.event_rate via &#39;alerts&#39; set DeltaEventCount = to_real(deltaEventCount), EventRate = to_real(eventRate), EventCount = eventCount, LastUpdate = getdate();<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- create event if deltaEventCount exceeds threshold<br />
<br />
&nbsp;&nbsp;&nbsp; if ( deltaEventCount &gt; 100 and eventRate &gt; 1.2 ) then -- catch false positives caused by trigger not running for a period<br />
&nbsp;&nbsp; &nbsp;<br />
insert into alerts.status (<br />
<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Identifier,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Node,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;AlertGroup,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;AlertKey,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Summary,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Severity,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Type,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Class<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;) values (<br />
<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&#39;alerts_stats_eventrate_threshold_&#39;+to_char(eventCount),<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&#39;stats_eventrate&#39;,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&#39;threshold event&#39;,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;to_char(deltaEventCount),<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&#39;Warning: &#39; + to_char(deltaEventCount)&nbsp; + &#39; alarms were received in the last &#39; + to_char(elapsedSec) + &#39; seconds&#39;,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;5,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;3600,<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;0<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;);<br />
&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;end if;&nbsp;&nbsp;&nbsp; &nbsp;<br />
<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;<br />
<br />
<br />
&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp; if(firstRun = true) then<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -- This will only run when our table is empty (first run)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set eventCount = 0;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; for each row srow in master.activity_stats where srow.DatabaseName = &#39;alerts&#39;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; begin<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; set eventCount = srow.StatusNewInserts + srow.StatusDedups + srow.JournalInserts + srow.DetailsInserts;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; end;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; insert into master.event_rate values(eventCount, getdate(), 0, &#39;alerts&#39;, 0); --first event rate will be 0.0<br />
&nbsp;&nbsp;&nbsp; end if;<br />
<br />
end</em></p>
<p dir="ltr">
&nbsp;</p>
Some years ago I visited an Icelandic telco to pitch the Netcool solution to them. Iceland is one of the world&#39;s smaller countries - more people work for IBM than live in Iceland - so this telco was a little more modest in scope than AT&amp;T or Sprint,...102025urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-0be35285-d4d3-4975-916d-a79a7e99b798What's in a name - correlation by exploiting naming conventions7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-04-23T08:42:32-04:002013-04-23T08:42:32-04:00<p dir="ltr">
In my blog last time I looked at correlating events under synthetic container events. That is useful when the potential root cause of an incident doesn&#39;t have an alarm associated with it. A more common form of correlation is to exploit a hierarchy in the topology, and ITNM and OMNIbus do that very effectively. The Netcool Knowledge Library (NcKL) ships with a set of automations that exploit known relationships between devices, the interfaces they have and the virtual circuits over those interfaces. However, if we understand how these correlation automations work we can extend their use beyond the domain of IP networks and SNMP management.</p>
<p dir="ltr">
Naming conventions are a familiar feature in IT, so familiar we are probably unaware of those conventions half the time. But consider the URL of a web page - that follows a naming convention. When networks were more hierarchical network managers attempted to make sense of them by deploying naming conventions and when the standards were developed for mobile networking - GSM and 3G - network naming conventions were standardised. On these networks devices were given what was called a Distinguished Name, which was a description that named the device uniquely (so a network wouldn&#39;t have 200 instances of &quot;Router1&quot;) and named it so its place in the network hierarchy was defined. An example of a DN for a cellular radio transmitter might be:</p>
<p dir="ltr" style="text-align: center;">
<strong>GSM/BSC2119/BCF145/BTS3/TRX2</strong></p>
<p dir="ltr">
A network operator would read that as being the second transmitter (TRX2) on a cell site (BSC3) attached via a circuit group (BCF145) to control region 2119 (BSC2119) on the GSM network. The network might have several thousand TRX2s but only one at that precise location and the DN uniquely identifies it. Enterprises sometimes adopt the same principles so that a wifi access point might be given the name HEADOFFICE/2NE/AP2 if its the second AP serving the north east wing on the second floor of the head office building.</p>
<p dir="ltr">
Apart from identifying a node uniquely, why are naming conventions useful in event management? Well in the GSM network example a failure in the fibre backhaul network would take out not just one circuit group and cell site, but several. On the other hand a power failure at a cell site would impact all the transmitters there but not anywhere else. We are back with the single incident - multiple alarms scenario, and we can exploit naming conventions to provide us with basic correlation to group multiple alarms under a single one for incident management.</p>
<h3 dir="ltr">
Netcool Advanced Correlation (AdvCorr)</h3>
<p dir="ltr">
Netcool AdvCorr is an integration of the rules file standardisation on the Netcool Knowledge Library and the event correlation based on topology of Network Manager. It is a related events correlation methodology so it requires a means of identifying which events could be related. There are two stages to the process:</p>
<ul dir="ltr">
<li>
pre-classification of events into whether they <strong>could</strong> be a root cause or whether they are <strong>always</strong> symptoms of another root cause</li>
<li>
identification of alarming objects into hierarchical classifications of Primary Objects, Root Objects and Secondary Objects</li>
</ul>
<p dir="ltr">
This is achieved in the probe rules. When probe rules files provided by NcKL are used then these classifications are already taken care of, but for rules outside the NcKL framework we need to provide that ourselves.</p>
<h3 dir="ltr">
Worked Example</h3>
<p dir="ltr">
This example uses a captured set of alarms from a real GSM cellular network, suitably anonymised of course. As the original source of alarms was a Nokia EMS, the rules file to be modified is one developed for Nokia NetAct. The first thing we need to do is to create a pre-classification file. This is where different alarm types are classified into root causes and symptoms. Fortunately the Nokia EMS, like most telco EMS, gives each alarm type a unique number so here the pre-classification can be done using a look-up table. (This file is in the zip archive linked to at the end of this blog).</p>
<p dir="ltr">
Pre-classification requires some domain knowledge, but not too much. Enough knowledge to understand that a device not responding to a ping is a symptom rather than a cause, and that a Link Down alarm may be a cause but it might also be the symptom of a higher order failure.</p>
<p dir="ltr">
The second thing that needs to be done in the rules file is to extract the Root, Primary and Secondary Objects from the Distinguished name. Since the DN has defined delimiters - in this case &#39;/&#39; - it&#39;s straightforward to do this using an extract statement in the rules file (A sample rules file is in the zip archive). The three fields that need to be populated are:</p>
<p dir="ltr" style="margin-left: 40px;">
@LocalPriObj - which should be the full DN</p>
<p dir="ltr" style="margin-left: 40px;">
@LocalRootObj - which in this case is the circuit group</p>
<p dir="ltr" style="margin-left: 40px;">
@LocalSecObj - which is an intermediate extract between Primary and Root if one is possible</p>
<p dir="ltr">
The rules that extract the Primary and Secondary Objects must also define their relationship to the Root object and to each other. Permitted values are Same (1), Alias (2) and Parent (3). Since an Object has multiple relationships the rules multiply the Root to Secondary relationship by 4 and the Secondary to Primary relationship by 16 and add them together. This turns the LocalObjRelate field into a bit position byte and thus all possibilities have a unique value.</p>
<p dir="ltr">
The final thing that needs to be done to the rules file is to incorporate (or include) the AdvCorr36.include.compat.rules which come as part of the NcKL archive</p>
<h3 dir="ltr">
AdvCorr in operation</h3>
<p dir="ltr">
There are three automations in the AdvCorr group. The first, AdvCorr_SetCauseType ensures compatibility with ITNM&#39;s RCA and sets the potential Cause Type to the same value as specified in the rulesfile lookup, unless there is a competing cause type set by ITNM. The other two automations populate root cause and symptom candidate tables which are then used in an iterative fashion to perform the containerisation.</p>
<p dir="ltr">
It can all seem complicated but these automations have been available since OMNIbus 3.6 and simply work.</p>
<p dir="ltr">
What is new though, and complements these automations very well is the new Event Viewer in WebGUI 7.4. The relationships between symptoms and root causes can be set up as a new Relationship Definition, and the Event Viewer configured to group these relationships with twisties available to expand and shrink the symptom events.</p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/advcorr-eventview.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/advcorr-eventview.png" style=" width:100%; display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
&nbsp;</p>
<p dir="ltr">
Like the synthetic event container automation covered in the last blog, these automations offer a way to reduce the number of alarms operators have to view without losing the detail needed to diagnose problems and assess impact.</p>
<p dir="ltr">
A zip file containing sample files can be found<a href="https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=cdd16df5-7bb8-4ef1-bcb9-cefb1dd40581#fullpageWidgetId=W05de62601548_4e85_8940_81bb58657a85&amp;file=22bffd65-ee68-4b07-ac89-c15367d30253">&nbsp;here</a>. The zip file also contains a word doc providing a fuller description.</p>
In my blog last time I looked at correlating events under synthetic container events. That is useful when the potential root cause of an incident doesn&#39;t have an alarm associated with it. A more common form of correlation is to exploit a hierarchy in the...124141urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00urn:lsid:ibm.com:blogs:entry-57e68b3a-7403-41e2-bc3d-f9f16c57a0d3Creating Event Containers - RCA when there is no root cause event7YA6_Wim_Harthoorn0600027YA6activefalse7YA6_Wim_Harthoorn0600027YA6activefalseComment Entriesapplication/atom+xml;type=entryLikestrue2013-04-16T07:18:33-04:002013-04-16T07:18:33-04:00<p dir="ltr">
<i>Netcool OMNIbus is in use in more than 2000 enterprises and service providers worldwide. The business issues that OMNIbus addresses will be different, at least in detail, across such a range of users, so it&#39;s quite likely that a business issue which OMNIbus could address has already been seen elsewhere and a solution put together. OMNIbus and the other Netcool products are customisable to a high degree and tools and automations can be, and have been, written by partners and customers as well as by IBM&#39;s own Best Practice and development teams. Within IBM there are OMNIbus experts whose experience in implementing Netcool solutions goes back some 15 years and covers a wide range of industries, and on the Network Management and Service Assurance blog we will gather and post some of that accumulated knowledge and experience in the hope that it may prove useful to readers.</i></p>
<p dir="ltr">
<i>If we can assist in the implementation of a solution by providing an install script or rulesfile modifications then they will be posted on Service Management Connect and a link will be provided from the blogs</i></p>
<p dir="ltr">
<b>Creating Event Containers</b></p>
<p dir="ltr">
<b>We&#39;ve all seen the scenario where something happens, there are a number of alarms but not actually one from the root cause. For example in a data centre a power breaker trips and as a result a dozen or two alarms from servers and UPS boxes hit the Active Event List. However generally not an alarm from the actual circuit board reporting the power trip. This means that our root cause analysis doesn&#39;t actually have the root cause event to pin the fault to. This can be a problem if our procedures require such a root cause event to source the trouble ticket. That&#39;s a simple example that might apply to an enterprise, but as we move towards cloud computing or start to manage more and more &quot;things&quot; it&#39;s likely we will encounter situations where the root cause is in a domain not visible to us more frequently. In those situations it would be useful to be able to synthetically create a container event which can be used as the root cause in problem management processes.</b></p>
<p dir="ltr">
<b>The scenario we initially looked at was a smart grid one. Smart meters can alarm on loss of power, and that is an important indication of service loss. However since a power failure is likely to hit an entire street many smart meters will report this but usually the local substation will not be instrumented to generate an alarm, and even if it was, in the fragmented world of power distribution ordained by the authorities to encourage competition, substation and customers may be owned and managed by different companies. So what we wanted was an event that could be used as the container event for multiple smart meter events, that would be the one displayed in the higher level UIs and be used to launch problem management processes. An automation was written that would:</b></p>
<ul dir="ltr">
<li>
<b>react when two or more smart meters with the same postcode (zip code) reported the same alarm</b></li>
<li>
<b>if it is known that only one meter is installed in a postcode then create a container alarm for that</b></li>
<li>
<b>create a synthetic alarm with the postcode as the node name and recording the number of alarms covered by the container</b></li>
<li>
<b>record the serial of the synthetic alarm in the reported alarms</b></li>
<li>
<b>if Service Request Manager responds to a trouble ticket creation by inserting a TT number in the synthetic alarm that that number is cascaded down to the relevant reported alarms</b></li>
<li>
<b>maintain currency of the synthetic alarm as new alarms come in or existing ones are cleared</b></li>
<li>
<b>clear the synthetic alarm when all or all but one of the reported alarms in the container are cleared</b></li>
</ul>
<p dir="ltr">
<b>Although the scenario here is one of smart meters, the automation uses the contents of standard fields such as @Class and @Location. This means that by modifying the filter set by @Class to look at a different range of alarms the automation can work for other alarm and equipment types.</b></p>
<p dir="ltr">
<b>The result can be seen in this screen shot</b></p>
<p dir="ltr">
<a href="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/alarmwindowscreenshot.png" style="width: 100%; display: inline-block;" target="_blank"><img alt="image" src="https://www.ibm.com/developerworks/community/blogs/More_from_OMNIbus/resource/BLOGS_UPLOADED_IMAGES/alarmwindowscreenshot.png" style=" display:block; margin: 1em 0pt 0pt 0pt; float: left;" /></a></p>
<p dir="ltr">
<b>The district alarm panel contains five synthetic alarms which summarise 42 individual meter alarms, and as shown, a work order number is cascaded down to the individual alarms covered.</b></p>
<p dir="ltr">
<b>The automation and its install script are posted <a href="https://www.ibm.com/developerworks/mydeveloperworks/groups/service/html/communityview?communityUuid=cdd16df5-7bb8-4ef1-bcb9-cefb1dd40581#fullpageWidgetId=W05de62601548_4e85_8940_81bb58657a85&amp;file=c5acd320-32af-4aea-8ebd-ee509b38b167">here</a> along with a capture file and the necessary rules and lookups for a stdin probe so that the automation can be trialled and/or demoed</b></p>
<p dir="ltr">
&nbsp;</p>
Netcool OMNIbus is in use in more than 2000 enterprises and service providers worldwide. The business issues that OMNIbus addresses will be different, at least in detail, across such a range of users, so it&#39;s quite likely that a business issue which...003120urn:lsid:ibm.com:blogs:entries-9c12c00f-1498-4573-8fc7-cfd022b8c15aGetting more out of Netcool OMNIbus2016-06-16T08:31:02-04:00